Unknown HTTP Code -63

March 23, 2020 15:14

I cannot figure out what this means. We have this happening on all of our crawls now. Anyone else run into this and/or know the cause? The sites are active and open when I click on the link. The sites are public and not password protected.

Comments

3 comments

Kenneth Keller March 23, 2020 16:27

I did track this down for status codes. https://github.com/internetarchive/heritrix3/wiki/Status-Codes

Though I am no closer to prereq I failed. More as I find out.

0

Comment actions Permalink
Kenneth Keller March 23, 2020 16:52 (Edited March 23, 2020 16:52)

It looks like, now, if there is a file that meets a restriction in the Seed Scope, the entire crawl for that site stops/fails. This is a new behavior. It used to skip the file and move one. So, I believe this is sorta solved. I'll need to find a new way to block file types.

0

Comment actions Permalink
Karl Blumenthal March 23, 2020 19:50

Hi Kenneth,

Sounds like you might have already determined this, but the issue in this case is specifically with the scoping rules that block "whois" and "robots.txt" requests. These are necessary processes for our crawling technology to follow before any site can be archived. Other scoping rules should not halt a crawl entirely. We have other methods to avoid robots exclusions when you need them. If you need to add the "Ignore robots.txt" feature for your future crawls for instance, please contact us directly here and we'll take care of it for you.

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?