Large number of queued documents, finished crawl, but no trap using Brozzler
We just used Brozzler to crawl a site with a lot of dynamic content. It worked well, and did ultimately finish (not due to time limit). We have over 250,000 queued documents. I spot checked them in proxy mode, and the URLs appear to be valid links and also seem to be properly archived (although it is possible, that those in the archive were captured in earlier non-Brozzler crawls). I'm somewhat new to Archive-it, and am just wondering why the large number of queued documents may remain with this finished crawl.
-
Official comment
The best way to confirm is to check the crawl report- does the exact same URLs show both queued and captured in the host report? The most likely explanation is that some sites have multiple versions of a URL (such different sizes of an image). This means that one URL can be captured and replay in the archived page, while the other URLs are queued.
Comment actions
Please sign in to leave a comment.
Comments
1 comment