Scheduled crawls with a large number of seeds

Comments

3 comments

  • Avatar
    Stefana Breitwieser

    We have a similar issue on a smaller scale, where we have large numbers of seeds that we wanted to crawl at a regular frequency but the crawl results were sometimes inconsistent between seeds in a large scheduled crawl. Some of these seeds worked better with a Brozzler crawl or with a Standard crawl, which also presented problems with running all of the seeds for the same frequency at the same time.

    We ended up un-scheduling the crawls, and use the frequency setting to keep them organized. We now crawl them manually one at a time or in smaller groups, which has been a successful strategy for us and has made QA more manageable. We only crawl 50-60 seeds using this workflow -- as our collection continues to grow, I'd definitely be interested to hear from others if this workflow scales well to much larger collections!

    0
    Comment actions Permalink
  • Avatar
    Alex Thurman (Edited )

    Hi Silvia

    In our Human Rights collection we have over 700 seeds that for years have been part of a Quarterly scheduled crawl, but like you and Stefana I have noticed increasingly unsatisfactory results with this approach. Last fall I finally used the Groups functionality to divide these seeds into 8 separate groups (with group names set to private) just for crawling. From now on I will be putting the quarterly crawl dates on my work calendar and manually starting crawls of each of the 8 seed groups on those dates. Each crawl has a data limit, and is set for 14 days, to maximize the chances of all seeds being captured adequately. Since crawling even 100 seeds at once can still lead to incomplete captures, I followed up these crawls with PDF-only crawls, again to maximize the amount of relevant material captured without too much of a data cost. 

    0
    Comment actions Permalink
  • Avatar
    Silvia Sevilla

    Now that we have the possibility to assign different groups to the same seed and to decide if each group is or not visible, I am also using this to divide the seeds. This way, even if I can't schedule the crawls, I will launch them manually according to my calendar. 
    Thank you for sharing your good practices!

    0
    Comment actions Permalink

Please sign in to leave a comment.