Overview
Check your archived Wayback pages soon after crawls complete to be sure they meet expectations. Allow at least an extra 24 hours for the Wayback pages to index before viewing them. Compare the Wayback pages with the same page on the live web if in doubt. When things don't look right on your Wayback pages, this article will provide some extra Quality Assurance steps you can try.
On this page:
- Access Wayback pages
- Review Crawl Reports
- Use Wayback QA
- Patch crawl blocked documents
- Crawl additional 'helper' seeds
- Open Wayback links in new tabs
Access Wayback pages
What it is
Wayback pages are archived versions of URLs that your crawls have collected. They contain the collected URLs at the end, but have a prefix that show they're archived URLs with their Wayback URL format. For example:
https://wayback.archive-it.org/17968/20230531175508/https://bacaa.art/
How it works
Accessing your Wayback pages depends on whether your crawl was a test crawl or a production crawl (One-Time, Scheduled, or Saved test crawl).
You can only access unsaved test crawls' Wayback pages through the test crawl's Seeds report, using the seed URLs' Wayback > links.
Unsaved test crawls' Wayback URLs will have -test after the collection number. For example:
https://wayback.archive-it.org/17968-test/20230531175508/https://bacaa.art/
The unsaved test crawls' Wayback pages will have a blue banner at the top. And these unsaved Wayback pages can't be patch crawled until they are saved.
Production crawls' Wayback pages can be accessed a few ways:
- Wayback > links in the crawls' Seeds reports (same as test crawls above)
- Wayback > links in the Collections' Seeds lists
- Archives tab search inside your Archive-It account
- Public collections' URL access points
These Wayback pages will have a yellow banner across the top and they can be patch crawled.
Once your Wayback pages are open, browse them by scrolling down them and following links in scope. Try hovering over or clicking on any dynamic and interactive content. If content seems missing, review your crawl's report to see if it was collected.
Review Crawl Reports
What it is
Crawl reports include detailed lists of what documents the crawler saw and collected during crawl time. Reviewing the crawl reports' lists helps determine if certain documents were collected or not.
How it works
When crawl reports' crawl statuses report anything other than "Finished" (e.g., "Finished: Time Limit" or "Finished: Data Limit"), this can indicate an incomplete crawl.
Check the Hosts report for numbers in the 3 right-hand columns: Blocked, Queued, or Out of scope. These sometimes reveal missing documents that may help improve replay of Wayback pages. Click directly on the numbers to see the detailed lists of documents from that row's host.
Numbers in the Blocked column show there was a robots.txt exclusion on those documents from that row's host:
- You can add a rule to "Ignore Robots.txt" for future crawls.
- You can also patch crawl these documents directly from the Hosts report (see below).
Numbers in the Queued column show documents that were seen by the crawlers as next in line for crawling. You'll need a longer crawl or a higher data or document limit to collect them all. Keep in mind that the crawler may not have had the chance to determine whether they were in-scope yet, though. And when this number is very high (tens or hundreds of thousands), it could be a crawler trap. But if those queued documents look useful for your collection:
- You can resume the production crawl if it's still within 7 days of completing.
- You can try a new test crawl with a longer duration.
Numbers in the Out of Scope column show the documents the crawler saw as out of bounds for the crawl. Possible reasons may include (but are not limited to):
If these documents look important, you can try expanding scope for the common strings found in them. Then test crawl the seed with your new scoping rule and adjust if necessary.
Use Wayback QA
What it is
Wayback QA is a tool that can detect missing documents from Wayback pages with yellow banners. It then allows you to patch crawl those documents. It's best practice to patch crawl Wayback pages sooner rather than later, before the live web pages have changed too much and the Missing URLs' files are no longer available.
How it works
When a Wayback page seems incomplete, and you're logged into your Archive-It account in the same browser, you can click Enable QA in the yellow banner. This activates the Wayback QA tool to search for the missing documents on the page. You can then follow the instructions here to launch the patch crawls for Missing URLs it detects.
Patch crawl blocked documents
What it is
When documents appear in the Blocked column of your production crawl's Hosts Report, you can patch crawl those documents directly from the Hosts report.
How it works
After checking the list and determining these are important documents, select the host with the checkbox to its left. This activates the blue bar above the table of hosts, where you can click the Run Patch Crawl button. When the patch crawl completes, a medical kit icon will show up next to the host.
Crawl additional 'helper seeds'
What it is
For Wayback pages with dynamic content that don't look right, try adding the URL for the subpage as its own individual seed. This helps focus the crawler on the content it may have missed in a larger crawl of the main seed. These seeds will feed the content they collect into the main seed once the collection has had a chance to index.
How it works
When adding a "helper seed", set its access to private. This ensures it won't become an access point on your public collection page. And set its seed type to One Page. This keeps the data collected very reasonable.
Crawl the "helper seed" together with the main seed, if possible. After the crawl has indexed, the replay of the main seed's Wayback page should include the content from the helper seed.
If the content has embedded video or audio, consider adding additional scoping rules for the platform to the helper seed:
- Check to see if it's compliant with youtube-dl.
- If it's not compliant, collection of the video file or audio file is still possible, but replay on the Wayback page or through the banner isn't likely.
- If you have a previous crawl report for the seed, you can check if it's compliant:
- Check the hosts report for the host "youtube-dl" in the filter box
- Find the matching youtube-dl file containing the page's URL
- (e.g., youtube-dl:https://www.jmkac.org/exhibition/magic-city/)
Open linked Wayback pages in new tabs
What it is
When a Wayback link won't work, try opening the linked page in a new tab. Sometimes links have hash marks # in their URLs. Other links are built with Javascript. Yet others are complex, like links to YouTube videos' watch pages from YouTube Channels or Playlists.
How it works
When you're pointing at a link on a Wayback page that doesn't seem to be opening, try right-clicking and selecting the top option to 'Open in a New Tab.'
Note that this is the only way to open YouTube video watch pages from YouTube channels or playlists.
Related content
Reading your crawl's hosts report
Comments
0 comments
Please sign in to leave a comment.