On this page:
- Why use test crawls
- How to do it
- Data De-duplication in Test crawls
- Video tutorial on test crawls
Why use test crawls
Whenever you create a new collection or add new seeds to an existing collection, we highly recommend running and evaluating a "test crawl" before you permanently add any new data to your Archive-It account. Production crawls will automatically add data to an account, so test crawls are an easy way to get an accurate indication of how your seeds will crawl, archive, and replay before you expend any of your account's data budget. They allow you to review and plan for the crawl duration, scoping rules, and other factors that can make your future "full production" crawls as successful and efficient as possible.
How to do it
How to run a test crawl
Once you have added a new seed or new seeds to a collection, you can run a test crawl in order to observe how they archive and replay, and, if necessary, take steps to improve that process. To do so, simply follow the directions provided in our full guidance on launching one-time and test crawls from a seeds list, remembering to click the radio button in the dialog box that indicates the crawl is indeed a "Test Crawl."
As with other one-time crawls, you can specify any data, document, or time limits that you wish to apply.
How to monitor a test crawl in progress
Once your new test crawl has been launched, you can monitor it from the "Crawls" sections of the web application, or from the "Current Crawls" or "Test Crawls" lists under the given collections Crawls tab:
Click the View > link in any of these lists in order to see a full crawl report for your test crawl as it runs. To understand and take action upon any aspect of this crawl, as you would any full production crawl, see our complete guidance on how to monitor your crawls. *Note: The one difference in monitoring capabilities between test and production crawls is that you cannot resume a stopped test crawl. That functionality exists only for full production crawls.
How to review the results of a test crawl
How to read test crawl reports
Once your test crawl has completed, you can find it's reports listed in either the "Crawls" section of the web application, as you would any full production crawl, or else from the "Crawl Reports" or "Test Crawls" lists in the given collection's Crawls tab.
Click the View > link in any of these lists in order to see your test crawl's full and completed reports. To read and interpret the information in these reports, consult our complete guidance on how to read your crawl's report.
How to browse test crawl results in Wayback
As with your full production crawls, you can browse the appearance of your test crawl results in Wayback, starting approximately 24 hours after your crawl completes. To do so, simply click the Wayback > link corresponding to each seed in your test crawl's Seeds report. The only place you can view unsaved test crawl results in Wayback is via the Seeds tab of your test crawl report.
As the blue banner on each page indicates, these are the results of a test crawl and must be either saved permanently, deleted manually, or allowed to automatically expire after 60 days. Directions for these options are provided below.
How to save or discard test crawl data
Test crawl data is temporary by default, but can be saved permanently at your discretion. After reviewing a completed test crawl you can decide to save the crawl, which will permanently archive it in your collection, or delete the crawl, if you want to refine your scoping and run a new test crawl. If you take neither action, your data will automatically expire and be deleted from Wayback 60 days after the crawl completed; report data will remain accessible. Saving test crawls includes saving all of the seeds in the test crawl.
Like regular "one-time" or recurring crawls, saved test crawls will require approximately 24 hours after the save process completes to index and appear in Wayback. The amount of time it takes a crawl to save will vary depending on the size and how many other crawls are in process of saving.
As soon as it completes, your test crawl's report provides both options in the form of buttons at the top of your view, complete with a banner reminder of how long you have left to make your decision before the data automatically expires:
*Note: While the WARC data for any expired/deleted test crawl will no longer be available to browse in Wayback, the test crawl report data will remain available for your future reference.
Why should I delete a test crawl rather than let it expire?
If you aren't satisfied with the results of a test crawl, we recommend deleting it and trying again rather than just letting the crawl expire on its own. This is because data captured in previous crawls can impact replay of content captured in future crawls and vice versa.
The goal of Wayback replay (both test and production) is to load complete pages whenever possible. It uses the timestamps and checksum values of archived documents in the CDX index to determine what to load, rather than looking at which crawl a document came from. This means that a document that was missing from test crawl A, but was captured in test crawl B, could potentially replay in captures from both crawls.
Data De-duplication in test crawls
Test crawls will de-duplicate against permanent data in your account, meaning any data captured via a production crawl (recurring or one-time) or saved test crawl. Test crawls do not de-duplicate against data captured in any unsaved test crawls.
Video tutorial on running test crawls