Archiving Issuu and Scribd

Overview

Many websites use third-party document hosting platforms (such as Issuu, Scribd, and other similar services) to embed publications, letters, and other types of documents into their pages. These complex embedded applications can be more challenging for our crawler to archive than other embedded media, such as images. However, you can improve the crawler's ability to archive these materials by performing some targeted scope modifications. This guide provides an overview of how to properly format, scope, and crawl Issuu and Scribd seeds.

Known issues

Issuu

⚠️ Issuu changes its source code regularly and radically. Successfully collected Issuu publications may not fully replay.

Scribd

✅ Currently, there are no known issues with archiving Scribd.

If you experience any issues collecting or replaying Issuu or Scribd seeds, submit a support ticket.

On this page:

How to select and format Issuu publications seeds
How to scope Issuu seeds
How to scope Scribd seeds
Running your crawl

How to select and format Issuu publications seeds

Individual publications are aggregated on publisher and curated stack pages. To replay as a flipbook, it is currently necessary to add the iframe URL (the rd4 viewer) as a seed for each publication.

How to scope Issuu seeds

Default scoping for Issuu seeds

New Issuu seeds added to collections will have the following default scoping rules applied automatically at the seed level. Older Issuu seeds can be updated by adding the below scoping rules manually or following these instructions.

To scope older Issuu seeds:

Expand the scope of your crawl to include the following SURT: http://(com,issuu,
Expand the scope of your crawl to include the following SURT: http://(pub,isu,
Add a rule to Ignore Robots.txt blocks on the host issuu.com
Exclude the host blog.issuu.com

How to scope Scribd seeds

To scope your Scribd seeds:

Expand the scope of your crawl to include the hosts: scribd.com, scribdassets.com, and www.scribd.com
Add a rule to Ignore Robots.txt blocks
Place a document limit of 1,000 documents on the host: scribd.com

Running your crawl

Crawl Issuu publications using Brozzler.

What to expect from archived Issuu seeds

To advance pages, click the horizontal bar. If pages are missing, use the Wayback QA tool to collect them and run a patch crawl.

Articles in this section

Archiving Issuu and Scribd

Overview

Known issues

How to select and format Issuu publications seeds

How to scope Issuu seeds

Default scoping for Issuu seeds

How to scope Scribd seeds

Running your crawl

What to expect from archived Issuu seeds

Comments

Articles in this section

Overview

Known issues

How to select and format Issuu publications seeds

How to scope Issuu seeds

Default scoping for Issuu seeds

How to scope Scribd seeds

Running your crawl

What to expect from archived Issuu seeds

Related articles