Scope rules and speed of crawl - is there and inverse relationship?
One of the state agencies we regularly crawl has set up distinct websites, with distinct domains, for individual projects - nearly 30 of them. We have the option of either establishing new seeds for each, or expanding the scope of the agency's main seed to include these project sites. I chose the latter just because these are all temporary project sites, and establishing new seeds for each would be a lot more work (creating metadata for each, etc.). But the test crawl I'm running is proceeding super slowly, and I'm curious about whether so many scope modifications dramatically slows down a crawl, or if I should be looking for another reason why the crawl is proceeding so slowly. Thanks!
-
Katie,
I recently touched on this topic with the help desk and they indicated that it's possible that a lot of scoping could slow down a crawl but they didn't really know for sure. Another thing that could be slowing it down is if any of the project sites have rules, through robots.txt or the like, that are restricting the speed of the crawling on those sites. You could check through their robots.txt files to see if anything's set up there.
Skip Kendall
Harvard University Archives
-
Thanks for asking Katie. Would you mind posting the seed and an example project site here? Or shoot me the crawl report directly, over at the help desk? I suspect that something other than scoping rules might be slowing your crawl artificially, but I'd be happy to confirm it.
Skip is right; site owners will sometimes limit the speed of crawling in order to manage the load on their servers. If the site has a robots.txt file, this "crawl-delay" can be specified in second intervals between requests. Here's an example from Archive-It's own robots file for instance: https://wayback.archive-it.org/org-884/20220329200736/https://archive-it.org/robots.txt
We have strategies to mitigate the effects of these and other kinds of delays when need be though. Please let us know if we can help from here on our side!
-
Thanks to you both! I just checked and the crawl is actually running much faster now, and I do not see the drawl delay in the robots.text file, so I think all is well. Because this is the first time I have added this many modifications to a single seed, I think I just jumped to an incorrect conclusion too quickly. The seed is dot.alaska.gov, and one of the project sites is vineandhollywood.com. We've got two test crawls going, Brozzler and standard.
Please sign in to leave a comment.
Comments
3 comments