Overview
We always recommend running a test crawl first when adding new seeds to a collection. This article provides instructions for running, monitoring, and reviewing test crawls. It also provides directions for saving or deleting test crawl data, best practices, and an overview of the test crawl data de-duplication process.
On this page:
- Why use test crawls
- Instructions
- How to run a test crawl
- How to monitor a test crawl in progress
- How to review the results of a test crawl
- How to save or discard test crawl data
- Why should I delete a test crawl?
- Data de-duplication in test crawls
- Video tutorial on test crawls
- Related content
Why use test crawls
Unlike Production crawls, test crawls will not automatically add data or Wayback captures to your account. They need to be manually saved for the data to be applied. Because of this, whenever you create a new collection or add new seeds to an existing collection, we highly recommend running and evaluating a "test crawl" first before you permanently add any new data to your Archive-It account.
Test crawls are an easy way to get an accurate indication of how your seeds will crawl, archive, and replay before you expend any of your account's data budget. They allow you to review and plan for the crawl duration, scoping rules, and other factors that can make your future "full production" crawls as successful and efficient as possible.
Instructions
How to run a test crawl
Once you have added a new seed or new seeds to a collection, you can run a test crawl in order to observe how they archive and replay and, if necessary, take steps to improve that process. To do so, follow the directions provided in our full guidance on launching one-time and test crawls from a seeds list. Remember to click the radio button in the dialog box next to "Crawl Type" to run it as a Test Crawl.
Outcome
After selecting "Test crawl" as the crawl type and adjusting any other crawl parameters, your test crawl will begin once the "Crawl" button is clicked.
How to monitor a test crawl in progress
Once your new test crawl has been launched, you can monitor it from the "Crawls" sections of your account, or from the "Current Crawls" or "Test Crawls" lists under the given collection's Crawls tab:
Click the Crawl ID link in the lefthand column from any of these lists to see a full crawl report for your test crawl as it runs.
Note: The one difference in monitoring capabilities between test and production crawls is that you cannot resume a stopped test crawl. That functionality exists only for full production crawls. |
How to review the results of a test crawl
How to read test crawl reports
Once your test crawl is complete, its report can be found in the "Crawl Reports" or "Test Crawls" lists in the given collection's Crawls tab.
Click the Crawl ID link in the lefthand column from any of these lists to see your test crawl's reports. To read and interpret the information in these reports, consult our complete guidance on how to read your crawl's report.
How to browse test crawl results in Wayback
You can browse your crawl results in Wayback approximately 24 hours after your crawl completes. To do so, click the Wayback > link corresponding to each seed in your test crawl's Seeds report.
Note: Test crawl captures are stored separately from your permanent collection, giving you the chance to assess before either saving them permanently or deleting them. The only place you can view unsaved test crawl results in Wayback is via the Seeds tab of your test crawl report. |
Wayback captures from unsaved Test crawls will have a blue banner at the top.
How to save or discard test crawl data
Test crawl data is stored for 60 days. After reviewing a completed test crawl, you can decide to save the crawl, which will permanently archive it in your collection, or delete the crawl if you want to refine your scoping and run a new test crawl. If you take neither action, your data will automatically expire and be deleted from Wayback 60 days after the crawl is completed; the crawl report will remain accessible. Once a crawl is saved, it cannot be deleted.
When the crawl is finished, you will see options to save or delete the captures from this test crawl at the top of the crawl report. This banner will also indicate how long you have left to make your decision before the crawl automatically expires:
Tip: While the WARC data for any expired/deleted test crawl will no longer be available to browse in Wayback, the test crawl report data will remain available for your future reference. |
Why should I delete a test crawl rather than let it expire?
If you aren't satisfied with the results of a test crawl, we recommend deleting it and trying again rather than just letting the crawl expire on its own. This is because data captured in previous crawls can impact replay of content captured in future crawls and vice versa.
The goal of Wayback replay (both test and production) is to load complete pages whenever possible. It uses the timestamps and checksum values of archived documents in the CDX index to determine what to load, rather than looking at which crawl a document came from. This means that a document that was missing from test crawl A, but was captured in test crawl B, could potentially replay in captures from both crawls.
Data de-duplication in test crawls
Test crawls will de-duplicate against permanent data in your account, meaning any data captured via a production crawl (recurring or one-time) or saved test crawl. Test crawls do not de-duplicate against data captured in any unsaved test crawls.
Video tutorial on running test crawls
Related Content
How to manually start test and one-time crawls
How to crawl new seeds immediately with InstaCrawl
Comments
2 comments
When I go to the "Crawl Reports" and "Test Crawls" tab I do not have a column for "View"- the last one I see is "Docs." I'm having this issue on multiple browsers and wondering if the system has been updated or if I'm missing something.
Hi Amanda,
Thanks for catching this! The "View " column no longer exists in these crawl tables. You can instead click on the Crawl ID (in the lefthand column) directly to view both in progress and completed crawl reports. I've updated the article accordingly.
Please sign in to leave a comment.