Overview
Many websites use third-party document hosting platforms (such as Issuu, Scribd, and other similar services) to embed publications, letters, and other types of documents into their pages. These complex embedded applications can be more challenging for our crawler to archive than other embedded media, such as images. However, you can improve the crawler's ability to archive these materials by performing some targeted scope modifications. This guide provides an overview of how to properly format, scope, and crawl Issuu and Scribd seeds.
Known issues
Issuu
⚠️ Issuu changes its source code regularly and radically. Successfully collected Issuu publications may not fully replay.
Scribd
✅ Currently, there are no known issues with archiving Scribd.
If you experience any issues collecting or replaying Issuu or Scribd seeds, submit a support ticket.
On this page:
- How to select and format Issuu publications seeds
- How to scope Issuu seeds
- How to scope Scribd seeds
- Running your crawl
How to select and format Issuu publications seeds
Individual publications are aggregated on publisher and curated stack pages. To replay as a flipbook, it is currently necessary to add the iframe URL (the rd4 viewer) as a seed for each publication.
How to scope Issuu seeds
Default scoping for Issuu seeds
New Issuu seeds added to collections will have the following default scoping rules applied automatically at the seed level. Older Issuu seeds can be updated by adding the below scoping rules manually or following these instructions.
To scope older Issuu seeds:
- Expand the scope of your crawl to include the following SURT: http://(com,issuu,
- Expand the scope of your crawl to include the following SURT: http://(pub,isu,
- Add a rule to Ignore Robots.txt blocks on the host issuu.com
- Exclude the host blog.issuu.com
How to scope Scribd seeds
To scope your Scribd seeds:
- Expand the scope of your crawl to include the hosts: scribd.com, scribdassets.com, and www.scribd.com
- Add a rule to Ignore Robots.txt blocks
- Place a document limit of 1,000 documents on the host: scribd.com
Running your crawl
Crawl Issuu publications using Brozzler.
What to expect from archived Issuu seeds
To advance pages, click the horizontal bar. If pages are missing, use the Wayback QA tool to collect them and run a patch crawl.
Comments
0 comments
Please sign in to leave a comment.