On this page:
- Why use test crawls
- How to do it
- Video tutorial on test crawls
Why use test crawls
Whenever you create a new collection or add new seeds to an existing collection, we highly recommend running and evaluating a "test crawl" before you permanently add any new data to your Archive-It account. Test crawls are an easy way to get an accurate indication of how your seeds will crawl, archive, and replay before you expend any of your account's data budget. They allow you to review and plan for the crawl durations, scoping rules, and other factors that can make your future "full production" crawls as successful and efficient as possible.
How to do it
How to run a test crawl
Once you have added a new seed or new seeds to a collection, you can run a test crawl in order to observe how they archive and replay, and, if necessary, take steps to improve that process. To do so, simply follow the directions provided in our full guidance on launching one-time and test crawls from a seeds list, remembering to click the radio button in the dialog box that indicates the crawl is indeed a "Test Crawl."
As with other one-time crawls, you can specify any data, document, or time limits that you wish to apply.
How to monitor a test crawl in progress
Once your new test crawl has been launched, you can monitor it from the "Crawls" sections of the web application, or from the "Current Crawls" or "Test Crawls" lists under the given collections Crawls tab:
Click the View > link in any of these lists in order to see a full crawl report for your test crawl as it runs. To understand and take action upon any aspect of this crawl, as you would any full production crawl, see our complete guidance on how to monitor your crawls. *Note: The one difference in monitoring capabilities between test and production crawls is that you cannot resume a stopped test crawl. That functionality exists only for full production crawls.
How to review the results of a test crawl
How to read test crawl reports
Once your test crawl has completed, you can find it's reports listed in either the "Crawls" section of the web application, as you would any full production crawl, or else from the "Crawl Reports" or "Test Crawls" lists in the given collection's Crawls tab.
Click the View > link in any of these lists in order to see your test crawl's full and completed reports. To read and interpret the information in these reports, consult our complete guidance on how to read your crawl's report.
How to browse test crawl results in Wayback
As with your full production crawls, you can browse the appearance of your test crawl results in Wayback, starting approximately 24 hours after your crawl completes. To do so, simply click the Wayback > link corresponding to each seed in your test crawl's Seeds report. The only place you can view unsaved test crawl results in Wayback is via the Seeds tab of your test crawl report.
As the blue banner on each page indicates, these are the result of a test crawl and must either be saved permanently to an existing collection in your account or deleted before they automatically expire. Directions for both options are provided below.
How to save or discard test crawl data
Please remember that test crawl data is temporary by default, but can be saved permanently at your discretion. After your test crawl completes and you have reviewed it, you can decide to save the crawl, which will permanently archive it in your collection, or delete the crawl, if you want to refine your scoping and run a new test crawl. (Note that, like regular "one-time" or recurring crawls, data once saved will require approximately 24 hours to index and appear at its permanent URL in Wayback). If you take neither action, your data will automatically "expire" and be deleted 60 days after the completion of your crawl.
As soon as it completes, your test crawl's report provides both options in the form of buttons at the top of your view, complete with a banner reminder of how long you have left to make your decision before the data automatically expires:
*Note: While the WARC data for any expired/deleted test crawl will no longer be available to browse in Wayback, the test crawl report data will remain available for your future reference.
Video tutorial on running test crawls