Scope rules and speed of crawl - is there and inverse relationship?



  • Avatar
    Skip Kendall


    I recently touched on this topic with the help desk and they indicated that it's possible that a lot of scoping could slow down a crawl but they didn't really know for sure. Another thing that could be slowing it down is if any of the project sites have rules, through robots.txt or the like, that are restricting the speed of the crawling on those sites. You could check through their robots.txt files to see if anything's set up there.

    Skip Kendall

    Harvard University Archives

    Comment actions Permalink
  • Avatar
    Karl Blumenthal

    Thanks for asking Katie. Would you mind posting the seed and an example project site here? Or shoot me the crawl report directly, over at the help desk? I suspect that something other than scoping rules might be slowing your crawl artificially, but I'd be happy to confirm it. 

    Skip is right; site owners will sometimes limit the speed of crawling in order to manage the load on their servers. If the site has a robots.txt file, this "crawl-delay" can be specified in second intervals between requests. Here's an example from Archive-It's own robots file for instance: 

    We have strategies to mitigate the effects of these and other kinds of delays when need be though. Please let us know if we can help from here on our side!

    Comment actions Permalink
  • Avatar
    Katie Fearer

    Thanks to you both!  I just checked and the crawl is actually running much faster now, and I do not see the drawl delay in the robots.text file, so I think all is well.  Because this is the first time I have added this many modifications to a single seed, I think I just jumped to an incorrect conclusion too quickly.  The seed is, and one of the project sites is  We've got two test crawls going, Brozzler and standard.

    Comment actions Permalink

Please sign in to leave a comment.