On this page:
What is Wayback QA?
Wayback QA is an automated quality assurance tool that scans the Wayback page you're viewing and identifies documents that were not captured initially by the crawler (blocked by robots.txt, out of scope etc.), giving you the option to patch those documents back into your Wayback page via a Patch Crawl. If the page you are viewing in Wayback is missing embedded elements like stylesheets, images, or other functionality, Wayback QA and patch crawling may be able to improve capture and replay.
Please note that Wayback QA will only look for content missing from the page you are currently viewing. Links out to missing pages will not generally be picked up using Wayback QA and in most cases should be scoped in using scoping rules or additional seed URLs.
Enable Wayback QA
- Begin by accessing one or more of your seed sites in Wayback mode.
- Click on the "Enable QA" link in the Wayback banner. This feature scans your view for missing elements. Note that you must be logged in to the Archive-It web application in order to see the Enable QA link in the Wayback banner. If you are logged in and still cannot see the link, log out of the web application and log back in, and refresh your Wayback view.
- After the Wayback page has finished reloading, click the "View Missing URLs (# Detected)" link in the banner.
- This will open the "Wayback QA" tab in our web application which lists all the URLs detected as missing by Wayback QA.
Select Documents
The "Wayback QA" tab opens to the "Missing Documents" sub-tab so that you can review the list and identify documents that you would like to patch crawl. You can filter the list for specific missing URLs by using the search bar at the top. In the table below, URLs are organized into the following sortable column headings:
Column Name |
Description |
Missing Document |
The specific URL of the resource detected as missing from Wayback. |
Doc. Missing From |
The URL for the web page on which Wayback detected the link, embedded element, or similar resource to be missing. |
Possible Reason
|
A diagnosis of why the missing document was not archived during the initial crawl. In cases where this value reads "Not Crawled," it may simply be patch crawled for inclusion. If this value reads "Blocked by Robots.txt," it can be patch crawled if you elect to ignore the robots.txt exclusion. When this value reads "Invalid," it may not be a live URL on the web, but rather a dynamically generated URL string for a non-existent resource. In this case, it may not patch crawl successfully. |
File Type |
The MIME type of the Missing Document, such as image/gif, text/html, application/pdf, etc. |
Size |
The data volume of each missing document. |
Crawl Date |
The date of capture for the archived web page from which the given resource is missing.
|
In some cases, missing documents will not add value to your collection and do not need to be patch crawled. This is true of most invalid URLs, as well as URLs for undesired content like advertising and customer tracking, as well as out of scope resources.
You can exclude such URLs from the "Wayback QA" tab by clicking on the check boxes to their left, and then on the "Ignore Selected URLs" button at the top of the table.
This action moves the selected URLs to the "Ignored Documents" sub-tab. You may navigate to that sub-tab and select the URLs again at any time in order to move them back into your list of missing documents.
Run patch crawls
Use the check boxes to the left of each missing documents listed in the Wayback QA tab to select it for patch crawling. Next, click the "Patch Crawl Selected" button at the top left of the table. This will open a dialog box that confirms your choice to launch a patch crawl and gives you the option of ignoring the robots.txt for any URLs that were blocked by robots.txt.
*Please note that a patch crawl will only capture the individual documents in your patch crawl list. Unlike a One-Time, Test, or Scheduled crawl, patch crawls will not follow links off of documents or capture embedded content in them, and do not adhere to collection or seed level scoping rules.
Click the "Crawl" button to start your patch crawl. Once launched, it will be added to your list of current crawls. When completed, you may reference its report under the Wayback QA tab's "Patch Crawl Reports" sub-tab. Like all other crawls, patch crawls take up to 24 hours after completion before they will be available in Wayback.
Pro Tip: When patch crawling many web pages or multiple websites, we recommend that you consolidate your patch crawls in order to make them most efficient. To so, detect the missing URLs by enabling QA on all the Wayback pages that you wish to patch crawl, but do not move on to the patch crawl step until you have finished this process. Next, when the detection of missing documents is complete, navigate in the web application to the "Wayback QA" tab in the relevant collection to select and patch crawl URLs that were detected across the entire collection.
Comments
0 comments
Please sign in to leave a comment.