What is a HTTP 999 error
I was trying to crawl a linked in page with both Brozzler and standard Archive-It and with both received a HTTP 999 error. I did it a second time, and the Brozzler crawl worked sorta. A. What is a HTTP 999 error? B. How do you best crawl a linked in page? C. are the parameters for crawling different for Brozzler rather than regular Archive-It?
Gabrielle Barr-National Library of Medicine
-
Official comment
Hi Gabrielle,
A 999 error is a form of user agent block that we’ve noticed linkedin.com seeds periodically return. Sometimes, the seed status in the crawl report will read “Crawled (HTTP error 999)” and still return a valid capture, other times the seed will have to be re-crawled. Due to the occasional nature of this issue, we haven’t developed a site-specific linkedin.com page for our help center.
Our best advice when crawling linkedin.com pages, is to do the following:
1. Add a seed-level ignore robots.txt rule
2. Add login credentials to your seed’s settings, if the content you wish to capture isn’t publicly available on the live web
3. Test crawl since you may not have a successful capture on the first try
4. Review your crawl results in Wayback and run a new test crawl, if you did not get a successful capture
Thanks for the great question!
MaryComment actions -
Hi, Julia. Mary's advice above is still current. I'd recommend starting by going through her steps, 1-4, if you have not already. If this fails to archive a specific LinkedIn seed though, then please feel free to share it with us directly and we can take a closer look with you.
Please sign in to leave a comment.
Comments
3 comments