Scoping Question for realscientists.org

Comments

1 comment

  • Official comment
    Avatar
    Mary Haberle

    This is a great question that strikes at the heart of scoping rules!

    The placement of the last forward slash ( / ) appearing in a seed URL is the primary way that the crawler determines what is in scope for a crawl. “Plus” seed types and expand scope rules can be applied to extend capture further.

    In your example, the format of your seed URL http://realscientists.org/author/jens-foell tells the crawler to put any URL beginning with the string http://realscientists.org/author/ in scope, which explains why http://realscientists.org/author/jens-foell/page/2/ was captured.

    However, the articles linked to from your seed page do not begin with the string http://realscientists.org/author/. Instead, they begin with http://realscientists.org/2016/ This, in combination with the fact that you did not use a Plus seed type or add any expand scope rules, means that the crawler considered them out of scope.

    Since there are only two pages that list all the articles authored by Jens Foell, I recommend:

    1) Edit your seed to include the ending slash: http://realscientists.org/author/jens-foell/

    2) Add the additional seed: http://realscientists.org/author/jens-foell/page/2/ (you may make it a ‘private’ seed if you do not want it to appear among the others on Archive-It’s public-facing website.)

    3) Set both seeds to the “One Page Plus” seed type and crawl them together.

    Adding an ending slash to your seeds will focus the crawler on pages related to your target author and reduce the capture of unwanted data. The One Page Plus seed type will expand the scope of your crawl to make it possible for you to capture the links listed on your seed page that do not share the beginning URL path of your seed. This is in fact the origin of the One Page Plus seed type; it was designed specifically to capture articles linked to a common feed even when the articles themselves come from places outside of the original seed URL’s natural scope.

    You can read more about scoping in our help center: https://support.archive-it.org/hc/en-us/sections/201864583-Scoping-Crawls

    And, our new video curriculum includes a Pre-crawl Scoping video that may be of interest: https://support.archive-it.org/hc/en-us/articles/216489103#gettingstartedPreCrawl

    Comment actions Permalink

Please sign in to leave a comment.