Archiving Tumblr sites

Overview

Tumblr is a microblogging and social networking website that allows users to publish short blog posts and share multimedia. This guide provides an overview of how to properly format, scope, and crawl Tumblr seeds.

Known issues

There are currently no known issues for archiving Tumblr. For a full list of known issues for archiving various platforms please visit our Status of monitored platforms page.

On this page:

How to select and format your Tumblr seeds
Scoping Tumblr seeds
Running your crawl
What to expect from archived Tumblr seeds

How to select and format your Tumblr seeds

To archive a Tumblr site and all of its component contents, remember that our default crawling scope requires that you format it with a slash ( / ) at the end. An effective seed URL for a Tumblr site, for instance, will typically look like: https://chicagoarchitecturebiennial.tumblr.com/

Scoping Tumblr seeds

Our software can effectively crawl, archive, and replay Tumblr sites successfully without scope modifications.

On occasion, Tumblr sites have been known to lead to crawler traps, which can unnecessarily deplete your account's document or data budget. Review your Hosts report to determine if our crawler is archiving unnecessarily many URLs from hosts that support Tumblr, and if so, apply the following modifications:

Limit your crawl to exclude documents from the host domain of your seed that contain the following text: /llsid
Limit your crawl to exclude documents from the host domain of your seed that contain the following text: ?route=
Limit your crawl to exclude documents from the host domain of your seed that contain the following text: ?before_time=
Limit your crawl to exclude documents from the following host: px.srvcs.tumblr.com

Running your crawl

We recommend that you crawl your seeds using Brozzler for best results.

What to expect from archived Tumblr seeds

The above directions should best prepare our crawlers to find and archive Tumblr-hosted material, but it is important to review the results of these crawls for completeness and accuracy. The "Archive" tab on Tumblr feeds will not replay in Wayback (e.g. https://chicagoarchitecturebiennial.tumblr.com/archive).

Articles in this section

Archiving Tumblr sites

Overview

Known issues

How to select and format your Tumblr seeds

Scoping Tumblr seeds

Running your crawl

What to expect from archived Tumblr seeds

Comments

Articles in this section

Overview

Known issues

How to select and format your Tumblr seeds

Scoping Tumblr seeds

Running your crawl

What to expect from archived Tumblr seeds

Related articles