Overview
Wayback QA is an automated quality assurance tool. It scans the Wayback page you're viewing and identifies documents that were not captured initially by the crawler (blocked by robots.txt, out of scope, etc.). It then gives you the option to patch those documents back into your Wayback page with a patch crawl. This article will provide step-by-step instructions for using the Wayback QA tool to start patch crawls.
Prerequisites
Wayback pages with yellow banners from saved test crawls or production crawls (One-Time or Scheduled). You must also be logged into your Archive-It account in the same browser.
On this page:
Enable Wayback QA
Browse your Wayback pages until you find content missing from the page you are currently viewing. If the page is missing embedded elements like images, stylesheets, or other functionality, patch crawls can help. Activate the Wayback QA tool to detect the missing URLs for the patch crawl:
- Click the Enable QA link in the yellow banner.
- The page will reload and the link will change to Disable QA. This will activate the tool.
- Browse the Wayback page(s) with missing content again to detect the Missing URLs.
View Missing URLs
When you finish browsing Wayback pages with missing content:
- Click the View Missing URLs (# detected) link in the banner.
- A new tab opens back into your Archive-It account's Wayback QA tab for that collection.
- The Wayback QA tab lists the Missing URLs detected from the last URL viewed.
Select Documents
The Wayback QA tab opens on the Missing Documents subtab and presents the Missing Documents List from the last Wayback URL viewed.
You can clear the filter with the X to see a list of all Missing Documents in the collection. You can then use this filter to find specific URLs or file extensions.
You can select the Missing Documents from the list with the check boxes next to them.
The table's columns may help select these URLs (left to right):
- Missing Documents – URLs detected as missing by the Wayback QA tool.
- Doc Missing From – URL for the live web page the document is from.
- Possible Reason – Why the missing document was not archived during the initial crawl.
- File Type – The MIME type of the Missing Document, such as image/gif, text/html, etc.
- Size – The data volume of each missing document.
- Crawl Date – The date for the crawl that collected the Wayback page.
Some missing documents will not add value and don't need to be patch crawled. This includes invalid URLs, advertising, and customer tracking, as well as out-of-scope URLs.
You can exclude such URLs by clicking on their check boxes, and then on the Ignore Selected URLs button. This will move the selected URLs to the Ignored Documents sub-tab.
Run patch crawls
When you select the Missing Documents with their check boxes, the Run Patch Crawls button is activated.
- Click the Run Patch Crawls button.
- A box opens to launch the patch crawl.
- Click the Ignore Robots.txt check box.
- Click the Crawl button to start your patch crawl.
Outcome
The patch crawl will be added to your list of current crawls. The patch crawl will only collect the individual documents in your patch crawl list. Patch crawls will not follow links off of documents and do not adhere to scoping rules.
When completed, you can see the crawl report in the Wayback QA tab under the Patch Crawl Reports sub-tab.
Patch crawls take up to 24 hours after completion before they will be available in Wayback. View your Wayback page again for improved replay.
Related content
How to change scope and run patch crawls from your Hosts report
Comments
0 comments
Please sign in to leave a comment.