We always recommend running a test crawl first when adding new seeds to a collection. This article provides instructions for how to run, monitor, and review test crawls. It also provides directions for saving or deleting test crawl data, best practices, and an overview of the test crawl data de-duplication process.
On this page:
- Why use test crawls
- How to run a test crawl
- How to monitor a test crawl in progress
- How to review the results of a test crawl
- How to save or discard test crawl data
- Why should I delete a test crawl?
- Data de-duplication in test crawls
- Video tutorial on test crawls
- Related content
Why use test crawls
Unlike Production crawls, test crawls will not automatically add data to your account. They need to be manually saved for the data to be applied. Because of this, whenever you create a new collection or add new seeds to an existing collection, we highly recommend running and evaluating a "test crawl" first before you permanently add any new data to your Archive-It account.
Test crawls are an easy way to get an accurate indication of how your seeds will crawl, archive, and replay before you expend any of your account's data budget. They allow you to review and plan for the crawl duration, scoping rules, and other factors that can make your future "full production" crawls as successful and efficient as possible.
How to run a test crawl
Once you have added a new seed or new seeds to a collection, you can run a test crawl in order to observe how they archive and replay, and, if necessary, take steps to improve that process. To do so, follow the directions provided in our full guidance on launching one-time and test crawls from a seeds list. Remember to click the radio button in the dialog box next to "Crawl Type" that indicates the crawl should be a "Test Crawl."
As with other one-time crawls, you can specify any data, document, or time limits that you wish to apply.
After selecting "Test crawl" as the crawl type and adjusting any other crawl parameters, your test crawl will begin once the "Crawl" button is clicked.
How to monitor a test crawl in progress
Once your new test crawl has been launched, you can monitor it from the "Crawls" sections of the web application, or from the "Current Crawls" or "Test Crawls" lists under the given collection's Crawls tab:
Click the Crawl ID link in the lefthand column from any of these lists to see a full crawl report for your test crawl as it runs. To understand and take action upon any aspect of this crawl, as you would any full production crawl, see our complete guidance on how to monitor your crawls.
|Note: The one difference in monitoring capabilities between test and production crawls is that you cannot resume a stopped test crawl. That functionality exists only for full production crawls.|
How to review the results of a test crawl
How to read test crawl reports
Once your test crawl has completed, you can find its report listed in either the "Crawls" section of the web application, as you would any full production crawl, or else from the "Crawl Reports" or "Test Crawls" lists in the given collection's Crawls tab.
Click the Crawl ID link in the lefthand column from any of these lists to see your test crawl's full and completed reports. To read and interpret the information in these reports, consult our complete guidance on how to read your crawl's report.
How to browse test crawl results in Wayback
As with your full production crawls, you can browse the appearance of your test crawl results in Wayback, starting approximately 24 hours after your crawl completes. To do so, simply click the Wayback > link corresponding to each seed in your test crawl's Seeds report.
|Note: The only place you can view unsaved test crawl results in Wayback is via the Seeds tab of your test crawl report.|
As the blue banner on each page indicates, these are the results of a test crawl and must be either saved permanently, deleted manually, or allowed to automatically expire after 60 days. Directions for these options are provided below.
How to save or discard test crawl data
Test crawl data is temporary by default, but can be saved permanently at your discretion. After reviewing a completed test crawl you can decide to save the crawl, which will permanently archive it in your collection, or delete the crawl, if you want to refine your scoping and run a new test crawl. If you take neither action, your data will automatically expire and be deleted from Wayback 60 days after the crawl completed; report data will remain accessible. Saving test crawls includes saving all of the seeds in the test crawl. Once a crawl is saved, it cannot be deleted.
Like regular "one-time" or recurring crawls, saved test crawls require approximately 24 hours after the save process completes to index and appear in Wayback. The amount of time it takes a crawl to save will vary depending on the size and how many other crawls are in process of saving.
As soon as it completes, your test crawl's report provides both options in the form of buttons at the top of your view. It also includes a banner reminder of how long you have left to make your decision before the data automatically expires:
|Tip: While the WARC data for any expired/deleted test crawl will no longer be available to browse in Wayback, the test crawl report data will remain available for your future reference.|
Why should I delete a test crawl rather than let it expire?
If you aren't satisfied with the results of a test crawl, we recommend deleting it and trying again rather than just letting the crawl expire on its own. This is because data captured in previous crawls can impact replay of content captured in future crawls and vice versa.
The goal of Wayback replay (both test and production) is to load complete pages whenever possible. It uses the timestamps and checksum values of archived documents in the CDX index to determine what to load, rather than looking at which crawl a document came from. This means that a document that was missing from test crawl A, but was captured in test crawl B, could potentially replay in captures from both crawls.
Data de-duplication in test crawls
Test crawls will de-duplicate against permanent data in your account, meaning any data captured via a production crawl (recurring or one-time) or saved test crawl. Test crawls do not de-duplicate against data captured in any unsaved test crawls.
Video tutorial on running test crawls