Scheduled crawls with a large number of seeds
I have more than 250 seeds in the same collection that need to be captured with a similar frequency (quarterly, in my case). However, scheduling a crawl with such a large number of seeds has several drawbacks:
- It is difficult to carry out quality control.
- Too many seeds can lead to the crawl never being completed.
- The capture of the content of the different seeds is irregular. Some of the seeds have the seed status "Not crawled (queued)" and others collect a huge amount of data.
I have seen that there are some posts mentioning this issue, but as at the moment there is no possibility to schedule more than one crawl for a given frequency I wonder if anyone can share best practices, workarounds or tips to solve this problem.
I can only think of dividing the seeds in different collections and schedule periodical crawls at the same time or set a different frequency for the crawls of the different seeds, so that some of them are captured quarterly and others with a different frequency, or even manually.
Thanks!
-
We have a similar issue on a smaller scale, where we have large numbers of seeds that we wanted to crawl at a regular frequency but the crawl results were sometimes inconsistent between seeds in a large scheduled crawl. Some of these seeds worked better with a Brozzler crawl or with a Standard crawl, which also presented problems with running all of the seeds for the same frequency at the same time.
We ended up un-scheduling the crawls, and use the frequency setting to keep them organized. We now crawl them manually one at a time or in smaller groups, which has been a successful strategy for us and has made QA more manageable. We only crawl 50-60 seeds using this workflow -- as our collection continues to grow, I'd definitely be interested to hear from others if this workflow scales well to much larger collections!
-
Hi Silvia
In our Human Rights collection we have over 700 seeds that for years have been part of a Quarterly scheduled crawl, but like you and Stefana I have noticed increasingly unsatisfactory results with this approach. Last fall I finally used the Groups functionality to divide these seeds into 8 separate groups (with group names set to private) just for crawling. From now on I will be putting the quarterly crawl dates on my work calendar and manually starting crawls of each of the 8 seed groups on those dates. Each crawl has a data limit, and is set for 14 days, to maximize the chances of all seeds being captured adequately. Since crawling even 100 seeds at once can still lead to incomplete captures, I followed up these crawls with PDF-only crawls, again to maximize the amount of relevant material captured without too much of a data cost.
-
Now that we have the possibility to assign different groups to the same seed and to decide if each group is or not visible, I am also using this to divide the seeds. This way, even if I can't schedule the crawls, I will launch them manually according to my calendar.
Thank you for sharing your good practices!
Please sign in to leave a comment.
Comments
3 comments