Many websites use third party document hosting platforms (such as Issuu, Scribd, and other similar services) to embed publications, letters, and other types of documents into their pages. These complex embedded applications can be more challenging for our crawler to archive than other embedded media, such as images. However, you can improve the crawler's ability to archive these materials by performing some targeted scope modifications, as described below.
We will add specific directions for archiving content from other such document hosting platforms as they arise in your crawls. If you experience any difficulty archiving content from these or other such services, please contact us.
How to archive Issuu publications
Select your seed URLs
At present, our crawling technologies can only fully crawl and archive publications hosted by Issuu when crawled at the seed level, meaning that each individual issue of a publication (for example: http://issuu.com/nyu.news/docs/wsn021915) must be added as an individual seed in your collection. (If you wish, you can mark each seed as "Private" in its collection's management interface so that it will not display on the public seed-listing page on archive-it.org, but users will still be able to navigate to the issues from within your collections.)
- Expand the scope of your crawl to include the following SURT: http://(com,issuu,
- Expand the scope of your crawl to include the following SURT: http://(pub,isu,
- Add a rule to Ignore Robots.txt blocks on the host issuu.com
- Block the host blog.issuu.com
At present, successfully archived Issuu publications can only reliably be replayed through Wayback in Proxy Mode. We are working on improving our ability to both capture and replay Issuu publications and will update this page as those improvements are made. Issuu changes its source code regularly and radically, so if you encounter any new problems capturing or replaying this its contents, submit a support ticket and we will investigate.
How to archive Scribd documents
Our partners have had qualified success archiving documents embedded with the hosting service Scribd. As this service updates and changes its highly dynamic source code quite frequently, we are regularly updating our most complete advice on crawl scoping modifications. Please use the rules below to give our crawler the best possible chance to archive Scribd hosted documents, and if you experience any further issues capturing or replaying them, contact us directly for further assistance.
- Expand the scope of your crawl to include the hosts: scribd.com, scribdassets.com, and www.scribd.com
- Add a rule to Ignore Robots.txt blocks and place a document limit of 1,000 documents on the host: scribd.com