Seed scoping: avoid seed to be discovered and crawled via another seed
In the documentation pages we can read : "Scope specifics - Standard crawls
The Standard crawling technology uses all seed URLs in a given crawl to determine that crawl's scope. This means, if seeds include links to one another, it's possible for content from one seed to be discovered and crawled via another seed. "
Is there a way to avoid this behavior other than to launch an individual crawl for each seed?
-
The only way that I can think that might work would be to create a scoping rule in each that excludes the others. That would be a fair amount of work, though.
It may not be a problem, though. De-duplication should prevent the same pages from being captured via different paths, meaning you should only get one copy of each file, even if it can be reached from different places.
-
From what I understand deduplication is only working within the same seed so in this case deduplication is not working well, and also the seed scope rules are taken into account are the ones of the original seed and not the ones of the seed the pages belong to, this is problematic as well.
But indeed with your solution those problems would be solved. Then we would need to create in each seed a rule excluding everything except the seed we want to capture right? It is worth testing is no easier option exist.
Thank you
Please sign in to leave a comment.
Comments
4 comments