Once your crawl has completed, it's important to conduct quality assurance (QA) to check that your Wayback links replay according to your expectations. If anything was not captured, it is better to catch these issues while a solution may still be found rather than months later when the content may have already changed on the live web. Use the steps below to guide you through the QA process and, if necessary, improve the capture of your crawl.
Review Crawl Reports
- Review the information in your crawl's "Overview" tab in order to make sure that the crawl completed successfully (see "Status") and that the volume of material archived meets your expectations. Errant or obstructed crawls may require that you modify the scope of your collection in order to more accurately target our crawler.
- Review the information in your crawl's Seeds report, particularly each "Seed Status" in order to make sure that all of your seeds were crawled successfully, or if alternatively any robots.txt exclusions or other errors prevented seeds from being crawled.
- Review the information in your crawl's Hosts report in order to determine if any valuable hosts were blocked from crawling by robots.txt exclusions or deemed as outside the scope of your crawl. Queued documents that do not point to a crawler trap may indicate that the time, document, or data limit on the crawl should be extended in order to capture missing elements. Some scoping changes can be made directly from the Hosts report, see Modify scope and run patch crawls from your report
Browse in Wayback
- Begin by accessing one or more of your seed sites in Wayback mode.
- Browse through your archived site(s), clicking links and activating dynamic media players in order to make sure that they were archived in accordance with your expectations. Whenever possible, and especially when assuring the quality of dynamic components like video and interactive applets, be sure to double-check your archives by browsing in Wayback's Proxy Mode. (If you have an especially extensive site or collection of sites in your archives, we recommend prioritizing those pages and elements that are most valuable to or representative of your collection in order optimize the time you commit to this process).
- For production and saved test crawls only, if you notice links leading to pages with "Not in Archive" messages or embedded elements that do not render, you can identify and capture the relevant URLs with the Wayback QA and patch crawling mechanisms, which are explained below.
Further Actions
If you see something unexpected, there are steps you can take:
Patch Crawl via Host Report
Your crawl report can help you decide how your crawl might be better scoped to capture more/better content in the future and to identify select missing elements for immediate inclusion with your completed crawl. You can modify your collection's crawl scope and run patch crawls directly from your crawl reports.
Use Wayback QA
Use Wayback QA on the pages with missing content. If the tool is able to find necessary missing URLs, run a patch crawl on them.
Crawl as a Seed
If the missing content is only on a specific page, identify whether that page was crawled as a seed or whether it was a subpage of another seed. If it was a subpage, consider crawling it as its own seed. This will allow it to be crawled by Umbra, which can often capture embedded or dynamic content better than Heritrix alone.
Open Links in a New Tab
If you encounter links in an archived page that don't prompt an action in your browser, you may have luck opening them in a new tab. This is a common issue for links in Wayback that are built using Javascript or that contain a #.
Comments
0 comments
Please sign in to leave a comment.