Scoping Question for realscientists.org

January 18, 2017 22:25

Greetings,

I am trying to archive all content written by Jens Foell to the site realscientists.org.

I used the seed URI http://realscientists.org/author/jens-foell which contains links to all of the author's contributions to this site.

I am using the Standard seed type and have not modified the scoping rules from the default.

Archive-It archives the seed URI and the next page URI of http://realscientists.org/author/jens-foell/page/2/, but according to the crawl reports, it considers all of the article URIs to be out of scope.

Does anyone have suggestions as to why Archive-It considers URIs like these to be out of scope:

* http://realscientists.org/2016/11/20/modeling-intestinal-cells-takes-a-lot-of-guts-helen-dockrell-joins-real-scientists/

* http://realscientists.org/2016/09/12/turn-up-the-speed-its-particle-accelerator-week-at-real-scientists-particle-2/

This is not the only site I am archiving where large parts of the crawl are considered to be out of scope, so I would really like to understand what is going on.

Thanks in advance,

Shawn

Comments

1 comment

Official comment

Mary Haberle January 27, 2017 17:09

This is a great question that strikes at the heart of scoping rules!

The placement of the last forward slash ( / ) appearing in a seed URL is the primary way that the crawler determines what is in scope for a crawl. “Plus” seed types and expand scope rules can be applied to extend capture further.

In your example, the format of your seed URL http://realscientists.org/author/jens-foell tells the crawler to put any URL beginning with the string http://realscientists.org/author/ in scope, which explains why http://realscientists.org/author/jens-foell/page/2/ was captured.

However, the articles linked to from your seed page do not begin with the string http://realscientists.org/author/. Instead, they begin with http://realscientists.org/2016/ This, in combination with the fact that you did not use a Plus seed type or add any expand scope rules, means that the crawler considered them out of scope.

Since there are only two pages that list all the articles authored by Jens Foell, I recommend:

1) Edit your seed to include the ending slash: http://realscientists.org/author/jens-foell/

2) Add the additional seed: http://realscientists.org/author/jens-foell/page/2/ (you may make it a ‘private’ seed if you do not want it to appear among the others on Archive-It’s public-facing website.)

3) Set both seeds to the “One Page Plus” seed type and crawl them together.

Adding an ending slash to your seeds will focus the crawler on pages related to your target author and reduce the capture of unwanted data. The One Page Plus seed type will expand the scope of your crawl to make it possible for you to capture the links listed on your seed page that do not share the beginning URL path of your seed. This is in fact the origin of the One Page Plus seed type; it was designed specifically to capture articles linked to a common feed even when the articles themselves come from places outside of the original seed URL’s natural scope.

You can read more about scoping in our help center: https://support.archive-it.org/hc/en-us/sections/201864583-Scoping-Crawls

And, our new video curriculum includes a Pre-crawl Scoping video that may be of interest: https://support.archive-it.org/hc/en-us/articles/216489103#gettingstartedPreCrawl

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?