Overview
Many websites use third-party document hosting platforms (such as Issuu, Scribd, and other similar services) to embed publications, letters, and other types of documents into their pages. These complex embedded applications can be more challenging for our crawler to archive than other embedded media, such as images. However, you can improve the crawler's ability to archive these materials by performing some targeted scope modifications. This guide provides an overview of how to properly format, scope, and crawl Issuu and Scribd seeds.
Known issues
Currently, there are no known issues with archiving Scribd.
At present, successfully archived Issuu publications will not fully replay. We are working on improving our ability to both capture and replay Issuu publications and will update this page as those improvements are made. Issuu changes its source code regularly and radically, so if you encounter any new problems capturing or replaying this its contents, submit a support ticket and we will investigate.
You can find a full list of known issues for archiving various platforms on our Status of monitored platforms page.
On this page:
- How to select and format Issuu publications seeds
- How to scope Issuu seeds
- How to scope Scribd seeds
- Running your crawl
How to select and format Issuu publications seeds
At present, our crawling technologies can only fully crawl and archive publications hosted by Issuu when crawled they are crawled as seeds, meaning that each individual issue of a publication (for example: http://issuu.com/nyu.news/docs/wsn021915) must be added as an individual seed in your collection. (If you wish, you can use the "Private" setting for these seeds in thier collection's management interface. This will prevent them from displaying on the public seed-listing page on archive-it.org. Users will still be able to navigate to the issues from within your collections.)
How to scope Issuu seeds
Default scoping for Issuu seeds
New Issuu seeds added to collections will have the following default scoping rules applied automatically at the seed level; older Issuu seeds can be updated by adding the below scoping rules manually or following these instructions.
To learn more, please visit Sites with automated scoping rules.
To scope your Issuu seeds:
- Expand the scope of your crawl to include the following SURT: http://(com,issuu,
- Expand the scope of your crawl to include the following SURT: http://(pub,isu,
- Add a rule to Ignore Robots.txt blocks on the host issuu.com
- Block the host blog.issuu.com
How to scope Scribd seeds
To scope your Scribd seeds:
- Expand the scope of your crawl to include the hosts: scribd.com, scribdassets.com, and www.scribd.com
- Add a rule to Ignore Robots.txt blocks
- Place a document limit of 1,000 documents on the host: scribd.com
Running your crawl
We recommend crawling Issuu publications using Brozzler.
If you experience any issues capturing or replaying Issuu of Scribd seeds, contact us directly for further assistance.
Comments
0 comments
Please sign in to leave a comment.