Zero Queued docs yet crawl continues to collect data

March 01, 2017 14:31

I have been running test crawls recently and I thought I could use queued documents as a metric to understand whether the crawl is near completion and whether a longer crawl might be necessary. I started a 24-hour test crawl yesterday afternoon and when I checked it last night there were no queued documents for any of the hosts in the crawl report. However when I looked this morning the crawler had somehow found 5 more gigs of data. Can someone explain to me how a crawl can continue if there are no queued documents?

Also, I have yet to see anything show up in the recently crawled box. How is it determined what shows up in this box?

Comments

1 comment

Mary Haberle March 07, 2017 16:16

Thanks for raising this question. Queued documents will not appear as part of realtime reporting because they are calculated after the crawl completes. In general, time, data, and document counts are the best metrics by which to monitor your running crawls. You can view and increase any one of these limits while a crawl is in progress, but please note that you cannot resume a stopped test crawl. There is more guidance on how to monitor running crawls available in our help center: https://support.archive-it.org/hc/en-us/articles/208332973-How-to-monitor-your-crawls

However, the blank “Recently Crawled” box you saw when viewing the reports for a running crawl is a bug that we see from time to time with our reporting system when there are a large number of simultaneous crawls running. In response to this scalability issue we are working behind the scenes to improve our reports data system so that we can ensure uninterrupted access to current crawl reports, in addition to more reliable crawl report data that is available immediately upon crawl completion. We plan to bring the improved system online in the coming months.

In the meantime, running test crawls is the most effective way to safeguard your annual data budget. Test crawls will allow you the ability to review your results in the crawl report and via Wayback 24 hours after a crawl completes.

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?