This is the second in a three-part Archive-It Workshop series for new Archive-It users. It was initially developed and presented to new members of the Community Webs cohort. Other workshops in this series include Selection, Scoping and Crawling, and How to find and use web archives.
On this page:
- Introduction
- Objective
- Training Recording
- Materials
- Learn: Review
- Where can I find my crawl report?
- What's inside a crawl report?
- What should I check first in my crawl report?
- Tips
- Learn: Quality Assurance
- Patch Crawl via the Hosts Report
- Use the Wayback QA Tool to run Patch Crawls
- Crawl Subpage as Its Own Seed
- Open a Subpage in a New Tab
- Upload WARCs
- Tips
- To Do
- Additional Resources
Introduction
Once your crawl jobs have completed, but before you provide access to the general public, you may want to review the results first for completeness and quality. While many websites archive fully on the first try, the same things that we love about how open, dynamic, and innovative the web is can sometimes make it challenging to archive. As a result, it’s important to review your crawls and conduct quality assurance (QA) to check that your Wayback links replay according to your expectations.
While we are always working to improve our service and adapt to new web archiving challenges, there are steps you can take to improve capture and replay of your archived websites. If anything was not captured, it is better to catch these issues early while a solution may be found rather than months later when the content may have already changed on the live web. Using Archive-It’s instructional documentation and supplementary video curriculum, this training will guide you through the review and QA process for your archived sites and, if necessary, the steps to improve the capture of your crawl. You will integrate the knowledge and tools provided into your own web archiving workflow and develop a process that best suits your organization.
Objective
This training will provide a quality assurance overview and help you to identify the different areas of a post crawl report, determine what was captured, and evaluate the quality of your capture by browsing it in Wayback. You will also learn next steps to improve capture and replay if your archived websites don’t look quite right.
After completing this training, you will understand the process around reviewing your completed crawls and using our quality assurance tools, including potential places for individualization and implications for specific institutions. You will begin to reflect on developing a Quality Assurance workflow that best suits the needs, resources, and priorities of your organization. Workflows that are appropriate for a large public library with multiple staff members who can contribute to Quality Assurance may look much different than those employed by smaller libraries with less staff or volunteer time to contribute.
Training recording
This workshop was presented as part of a training series for new Community Webs partners.
▼ Watch recording
Recorded May 18, 2021
Materials
- The Archive-It Help Center
- Your Archive-It account
- One crawl report from a Test Crawl, Saved Test Crawl, or Production Crawl (One-Time or Scheduled Crawl) that completed over 24 hours ago (for the in-person workshop, please have one that’s not social media; due to frequent updates to social media platforms these remain moving targets to archive)
Learn: Review
Before looking at your crawl in Wayback, it’s important to check your crawl report for any issues or incompleteness. For this section you will need a crawl report to review, either a Test Crawl or Production Crawl. You may have some crawl reports already from completing the previous training on Selection, Scoping, and Crawling.
Before beginning review of your crawl report, read the Quality Assurance Overview to familiarize yourself with the most important steps involved in the review and QA process. This lesson will introduce the below concepts in more detail to begin guiding you through reviewing your reports.
Where can I find my crawl report?
There are a couple ways you can access your Crawl Reports. You can find all of your Crawl Reports through the Crawls link in the navigation bar of the web application under the “Crawl Reports” tab. You can also find them through the Crawls tab in a given collection. All crawls have a unique identifier called a Crawl ID, and reports are listed by Crawl ID (found to the left of each crawl) by default. You can click on the Crawl ID link associated with the crawl to access your report. For a visualization, see our article on where to find your crawl report.
What’s inside a crawl report?
Each crawl report is comprised of four separate parts: The Crawl Overview, Seeds Report, Hosts Report, and File Types Report. While the Crawl Overview provides a high level summary of how that crawl was conducted, the other three tabs include more specialized information on the seeds, hosts, and file types included in your crawl. Watch this approximately 7 minute training video on getting the most from your crawl reports before we dive into the specifics of each sub-report.
Reading your crawl report
Crawl Overview
Checking the Crawl Overview tab is the first step in the review process. This tab provides a general summary of the crawl and is a good place to check first when a crawl finishes. This tab includes:
- The crawl status: This indicates why a crawl finished and whether it finished because of a limit in place. Frequent crawl statuses include Finished, Document Limit, Data Limit, or Time Limit.
- Summary data: This indicates how much total data was crawled and how much new data was added to your collection and data budget. You can learn more about why crawled data might not count toward your data budget by reading about our data de-duplication process.
- Crawl frequency: This indicates what type of crawl you ran, such as a Test Crawl, One-Time Crawl, of Scheduled Crawl. If a Scheduled Crawl, it will also indicate the frequency (Daily, Monthly, Annually, etc.).
- Scoping rules: You can also find a list of scoping rules in place for this crawl from this tab. This includes any crawl limits set, collection scope rules, and seed scope rules.
Seeds Report
This report can be found next to the “Overview” tab and it displays information specific to how each seed in your crawl was archived. This tab includes:
- Seed status: This indicates whether each seed was successfully crawled, and if not, what happened. Common seed statuses include Crawled, Redirected, Not Crawled, Queued, and Blocked. Read this article on How to interpret crawl status codes in your Seeds Report.
- Documents and data: This indicates the number of both total and new documents and data archived during your crawl. You can find this information broken down by seed at the table at the bottom of this tab. They are also contextualized graphically so you can easily compare how much data was archived from each seed.
- Hosts, by seed: If you click on any of the seed URLs listed in this tab, this narrows your view even more by bringing you to a report breaking down every host crawled through that specific seed.
- Wayback View: Each captured seed listed in this report is accompanied by a "Wayback" link in the far-right column of the seeds table. You can click this link to view how each seed renders in Wayback. Remember that this view is only available approximately 24 hours after the crawl completes.
Hosts Report
This report includes information on every distinct host site to which your crawl was led. This can include your seed URLs in addition to all other sites considered or directed to be in scope.
Use the graphic at the top of the report to discover how much data a host site collected and compare it to others. The table at the bottom of the report can be used to browse precise figures on a host-by-host basis, including viewing complete listings of all total and new documents. All URLs listed in the table by the categories may be reviewed by clicking on the hyperlinked number in the table.
The three columns on the right-hand side of the table give you important information about what documents weren’t archived in your crawl, why, and what you can do:
- Blocked: This column indicates that these URLs were not archived because they were excluded from crawling by the robots.txt protocol. In the next section, you’ll learn how to immediately patch crawl documents blocked by robots.txt from this tab.
- Queued: This column indicates that the URLs were not archived because our crawler was not able to archive them before reaching a predetermined time, data, or document limit. A high number of queued URLs may be indicative of a crawler trap. We recommend learning about crawler traps and how to look for them in Queued URLS in our Help Center.
- Out of Scope: This column indicates that the crawler did not archive these URLs because it deemed them to be outside of the scope of your collection. You can read about how to modify the scope of your collection's crawls to explicitly include URLs of their type in the future
The Hosts Report can be overwhelming, but it includes a lot of important data about your crawl. We recommend watching our approximately 8 minute training video on Understanding your Hosts Report to gain a deeper understanding of the different ways it can be used.
File Types Report
This report includes data specific to each type of file archived during the course of a crawl.
The File Types report organizes and provides access to all URLs crawled by type, and the graphics at the top indicate the top file types encountered during your crawl. Here, you can find data specific to each type of file archived during the crawl. You can further narrow to find documents archived for each type by clicking directly on the file type (hyperlinked). Additionally, from here you can also play captured videos in our Video Player by clicking “View” in the right-hand column of the video/MP4 file type.
What should I check first in my crawl report?
Crawl reports can contain quite a bit of information, especially if you are crawling multiple seeds at once. We recommend reading this step-by-step list of what areas of a crawl report you should check first when your crawl finishes.
Tips
Don’t wait too long to review and conduct QA. Live websites are constantly changing and being updated. If there is something missing from the crawl, you might not be able to fix it before the content changes or is removed.
If a production crawl finishes due to a limit, you can always resume the crawl within seven days of it stopping in order to pick it up right where it left off.
If you find any issues with a website or crawl, note them down to keep track of them, so that you can remember later why they were resumed, patch crawled, or needed extra QA.
Learn: Quality Assurance
After reviewing your crawl reports but before saving a test crawl, it's always a good idea to check the replay of your archived pages in Wayback when you can to make sure they meet your expectations and will serve future users' needs.
You can load your archived Wayback pages through the Seeds Reports' "Wayback" links on the right next to each seed when you browse your archived pages in Wayback. Note that links in the Seeds Report are the only place where you can access test crawl results in Wayback.
Look over your Wayback page and compare its look and feel to its counterpart on the live web to make sure you captured everything you want. Look for missing elements on the archived page (images, menus, headers, footers, grids, etc.). Also look for elements that don't work (links, embedded video and audio players). Identify links that lead to "Not in Archive" page and take note of their URLs for further QA steps. Keep notes of unexpected replay issues and cross-check issues with the System Status page.
If things look similar to what you see on the website on the live web, wonderful! If it's a test crawl, save it. If things don't quite look right, don't worry! There are a few steps you can try to improve the quality of these archived web pages on Saved Test or Production crawls.
Patch Crawl via the Hosts Report
Missing documents that were blocked by "Robots.txt" files can be easily patch crawled in through the Hosts Report for Saved Test, One Time, and Production crawls.
Return to the crawl's Hosts Report to check the "Blocked Column" and see if the URLs that were missing match those hosts. If there are URLs there, start a patch crawl by clicking the check box to select the host, then clicking the "Run Patch Crawl" button. Be sure to also check the resulting box to "Ignore Robots.txt" Check your emails for messages that the patch crawl has completed.
Use the Wayback QA Tool to run Patch Crawls
Wayback QA is a tool that scans the Wayback page you're viewing and identifies documents that were not captured initially by the crawler. It allows you to patch those documents back into your Wayback page through a Patch Crawl. Missing style elements or embedded elements may be improved by Patch Crawls with Wayback QA.
Whenever you are logged into your Archive-It account you can Enable QA in your yellow Wayback banner (Note that this tool is not available for Unsaved Test Crawls with the Blue Wayback banner). Then keep QA enabled as you look over your Wayback page.
The Wayback QA Tool detects the missing documents when you scroll down the page, and click, hover or otherwise activate features on the page. When done looking it over, click the "View Missing URLs" link in the banner to open the Wayback QA tool in the partner application.
"Wayback QA" opens to the "Missing Documents" sub-tab so that you can review the list and identify documents that you would like patch crawl. You can filter the list for specific missing URLs by using the search bar at the top.
From there, you can run patch crawls on the missing URLs. Use the check boxes to the left of each missing document listed in the Wayback QA tab to select it for patch crawling. Next, click the "Patch Crawl Selected" button at the top left of the table. This will open a dialog box that confirms your choice to launch a patch crawl and the option to "Ignore Robots.txt" for any URLs that were blocked by robots.txt. Click the "Crawl" button to start your patch crawl.
Crawl Subpage as Its Own Seed
If the missing content is only on a specific page, identify whether that page was crawled as a seed or whether it was a subpage of another seed. If it was a subpage, try to crawl the subpage as its own seed. This can focus the crawler to pick up the dynamic elements embedded on that page specifically. Set this seed to the One Page seed type and its access to "private," and then crawl it together with the main seed it was from.
Open a Subpage in a New Tab
Sometimes links don't seem to work on the page, but you can open pages in a new tab. The easiest way to open them is to right click them and then select the top option "Open Links in a New Tab." This method is most useful for links with URLs that have # in them, links built with Javascript, or when attempting to open videos on Youtube Channels or Playlists.
Upload WARCs
If after trying all of these steps your page still doesn't capture or look good in Wayback, try using another capture mechanism, like Conifer, and uploading your WARCs. Conifer is a free web recorder that allows you to crawl websites and create WARCs that can later be integrated into your Archive-It account using the Upload WARCs feature. The Upload WARCs feature is one that can be enabled in your Archive-It account by submitting a support ticket.
Tips
- Wait 24 hours after crawls complete to avoid the "Not In Archive" page.
- Keep notes of any remaining "Not In Archive" pages' URLs after the 24-hour indexing period.
- Patch crawl missing documents soon, before the website changes on the live web.
- Clear your browser's cache of cookies before loading Wayback pages.
- Keep notes of unexpected replay issues and cross-check issues with the System Status page.
- Troubleshoot blank pages with these steps before submitting a support ticket.
- Keep track of how long the QA process takes you.
Additional Resources
Read
Learn about comparing two separate crawls to evaluate how crawls of a specific seed vary or evaluate the effectiveness of any new scoping rules you may have added.
Reflect on common challenges confronted by partners during the Review and QA process by reviewing their thoughts from our last Archive-It Partner Meeting (2020), where they held a Quality Assurance Discussion.
Watch
What can you do if your archived websites don't look quite right in Wayback? Watch our post-crawl analysis training videos on Quality Assurance:
- Quality Assurance [8:17]
- Web Archiving Quality Assurance [43:21]
Discuss
Explore our Collection Building Resources and Guidelines Community Forum to see how other partners are doing QA or to share your workflow more widely.
To Do
Using the information you've learned in this training, please try out the following steps. You will need a Saved Test Crawl, or Production Crawl (One-Time or Scheduled Crawl) that was completed over 24 hours ago available to try these steps.
- Using this Review Checklist, go over your Crawl Report to determine what was captured, not captured, and next steps.
- Using this QA Checklist, browse your archived seeds in Wayback mode to check for completeness and to determine additional QA steps.
Keep track of the time this process takes, any actions taken or considered, and any changes to be made. Later, these can be incorporated into your own organizational QA process in order to personalize the checklist to work for your institution.
Comments
0 comments
Please sign in to leave a comment.