Seed scoping: avoid seed to be discovered and crawled via another seed

Comments

4 comments

  • Avatar
    Skip Kendall

    The only way that I can think that might work would be to create a scoping rule in each that excludes the others. That would be a fair amount of work, though.

    It may not be a problem, though. De-duplication should prevent the same pages from being captured via different paths, meaning you should only get one copy of each file, even if it can be reached from different places.

    0
    Comment actions Permalink
  • Avatar
    Evelyne

    From what I understand deduplication is only working within the same seed so in this case deduplication is not working well, and also the seed scope rules are taken into account are the ones of the original seed and not the ones of the seed the pages belong to, this is problematic as well. 

    But indeed with your solution those problems would be solved. Then we would need to create in each seed a rule excluding everything except the seed we want to capture right? It is worth testing is no easier option exist. 

    Thank you  

    0
    Comment actions Permalink
  • Avatar
    Skip Kendall

    Ah, yes, I forgot about that.

    Yeah, that's what I was thinking about with the rule. If you were crawling 4 seeds together, each would need rules to exclude the other 3.

    0
    Comment actions Permalink
  • Avatar
    Evelyne

    yes but as we have over 300 seeds to crawl together ... so it scales in a more complex way ;)

    0
    Comment actions Permalink

Please sign in to leave a comment.