Why perform QA?
Once your crawl has completed, it's important to conduct Quality Assurance (QA) in order to check that your collection replays completely and accurately according to your expectations. If anything was not archived fully or correctly, it is best to catch these issues sooner (when a solution may be found) rather than later (when it may be too late because the content you are looking for has already changed on the live web).
How to do it
QA can be performed quickly and easily if you remember to look for a few simple things. You can get a feel for the quality of your archived collections by simply reviewing a few key indicators in your crawl reports and by quickly browsing through your archived seed sites as they appear in Wayback. Use the checklist below to guide you through this process and, if need be, to improve the completeness and accuracy of any crawl.
A. Review crawl reports
- Review the information in your crawl's "Overview" tab in order to make sure that the crawl completed successfully (see "Status") and that the volume of material archived meets your expectations. Errant or obstructed crawls may require that you modify the scope of your collection in order to more accurately target our crawler.
- Review the information in your crawl's Seeds report, particularly each "Seed Status" in order to make sure that all of your seeds were crawled successfully, or if alternatively any robots.txt exclusions or other errors prevented seeds from being crawled.
- Review the information in your crawl's Hosts report in order to determine if any valuable hosts were blocked from crawling by robots.txt exclusions or deemed by our crawler to be outside of the scope of your collection. You can immediately bypass such omissions by ignoring robots.txt exclusions, patch crawling, and modifying the scope of your collection's crawls directly from the Hosts report. Queued documents that do not point to a crawler trap may indicate that the time, document or data limit on the crawl should be extended in order to capture missing elements.
B. Browse archived sites in Wayback
- Begin by accessing one or more of your seed sites in Wayback mode.
- Browse through your archived site(s), clicking links and activating dynamic media players in order to make sure that they were archived in accordance with your expectations. Whenever possible, and especially when assuring the quality of dynamic components like video and interactive applets, be sure to double-check your archives by browsing in Wayback's Proxy Mode. (If you have an especially extensive site or collection of sites in your archives, we recommend prioritizing those pages and elements that are most valuable to or representative of your collection in order optimize the time you commit to this process).
- For production and saved test crawls only, if you notice links leading to pages with "Not in Archive" messages or embedded elements that do not render, you can identify and capture the relevant URLs with the Wayback QA and patch crawling mechanisms:
- To detect missing URLs in Wayback view, click on the "Enable QA" link in the Wayback banner. This feature scans your view for missing elements, which you may then review by following the "View Missing URLs (# Detected)" link in the banner.
- From the "Wayback QA" tab in our web application, you can select the URLs that you wish and click Patch Crawl in order to add them to each page.
- Note that you must be logged in to the Archive-It web application in order to see the Enable QA link in the Wayback banner. If you are logged in and still cannot see the link, quickly log out of the web application and log back in, then refresh your Wayback view.
- When conducting thorough patch crawls, spanning many web pages or multiple websites, we recommend that you consolidate your patch crawls in order to make them most efficient. To so, detect the missing URLs in the manner described above, but do not patch crawl any URLs until you have reviewed all of the material that you wish to QA. Once browsing is complete, navigate to the Wayback QA tab for the relevant collection in the web application and patch crawl select or all URLs missing from the pages visited across the entire collection.
Understanding the Wayback QA tab
Each collection's management page includes a tab for "Wayback QA," which lists all the URLs detected as missing by enabling Wayback QA while reviewing production and saved test crawl captures:
By default, this tab opens to a sub-tab called "Missing Documents," from which you may filter specific missing URLs by way of the search bar at the top or else browse them all in the table below. These URLs may in turn be organized by the column headings in the table:
- Missing Document: The specific URL of the resource detected as missing from Wayback
- Doc. Missing From: The URL for the web page on which Wayback detected the link, embedded element, or similar resource to be missing.
- Possible Reason: A diagnosis of why the Missing Document was not archived during the initial crawl. In cases where this value reads "Not Crawled," it may simply be patch crawled for inclusion. In cases where this value reads "Blocked by Robots.txt," it can be patch crawled if you elect to ignore the robots.txt exclusion. In cases where this value reads "Invalid URL," it is not necessarily a live URL on the web, but rather a dynamically generated URL string for a resource that may not exist, and so might not patch crawl successfully.
- File Type: The MIME type of the Missing Document, such as image/gif, text/html, application/pdf, etc.
- Size: The data volume of the Missing Document
- Capture Date: The date of capture for the archived web page from which the given resource is missing
Running patch crawls
All Missing Documents listed in the Wayback QA tab may be patch crawled immediately and in bulk by clicking on the Patch Crawl All button at the top-right of the table. If you desire, you may also specify select URLs to patch crawl by clicking on the check box next to each in the table, then clicking on the Patch Crawl Selected button at the top-left of the table. Regardless of how you choose to select your URLs, clicking on either of these buttons will open a dialog box that confirms your choice to launch a patch crawl and gives you the option of ignoring the robots.txt for any URLs that were specifically blocked by that protocol:
Click on the Crawl button to launch your patch crawl of any Missing Documents. Once launched, it will be added to your list of current crawls. When completed, you may reference its report under the Wayback QA tab's "Patch Crawl Reports" sub-tab.
Ignoring undesired documents
In some cases, Missing Documents present no significant value to your collections and need not be patch crawled--be they invalid URLs, URLs for undesired content such as advertising and customer tracking elements, or any other resources too voluminous or out of scope to warrant inclusion. If you choose, you may always 'ignore' these URLs by clicking on their respective check boxes in the table, then on the Ignore Selected URLs button at the top-left of the table. This action removes the selected URLs from the table to the Wayback QA tab's "Ignored Documents" sub-tab. You may navigate to that sub-tab and select the URLs again at any time in order to move them back into your list of Missing Documents.