Overview
Tumblr is a microblogging and social networking website that allows users to publish short blog posts and share multimedia. This guide provides an overview of how to properly format, scope, and crawl Tumblr seeds.
Known issues
There are currently no known issues for archiving Tumblr. For a full list of known issues for archiving various platforms please visit our Status of monitored platforms page.
On this page:
- How to select and format your Tumblr seeds
- Scoping Tumblr seeds
- Running your crawl
- What to expect from archived Tumblr seeds
How to select and format your Tumblr seeds
To archive a Tumblr site and all of its component contents, remember that our default crawling scope requires that you format it with a slash ( / ) at the end. An effective seed URL for a Tumblr site, for instance, will typically look like: https://chicagoarchitecturebiennial.tumblr.com/
Scoping Tumblr seeds
Our software can effectively crawl, archive, and replay Tumblr sites successfully without scope modifications.
On occasion, Tumblr sites have been known to lead to crawler traps, which can unnecessarily deplete your account's document or data budget. Review your Hosts report to determine if our crawler is archiving unnecessarily many URLs from hosts that support Tumblr, and if so, apply the following modifications:
- Limit your crawl to block any URLs from the host domain of your seed that contain the following text: /llsid
- Limit your crawl to block any URLs from the host domain of your seed that contain the following text: ?route=
- Limit your crawl to block any URLs from the host domain of your seed that contain the following text: ?before_time=
- Limit your crawl to block all URLS from the following host: px.srvcs.tumblr.com
Running your crawl
We recommend that you crawl your seeds using Brozzler for best results.
What to expect from archived Tumblr seeds
The above directions should best prepare our crawlers to find and archive Tumblr-hosted material, but it is important to review the results of these crawls for completeness and accuracy. The "Archive" tab on Tumblr feeds will not replay in Wayback (e.g. https://chicagoarchitecturebiennial.tumblr.com/archive).
Comments
0 comments
Please sign in to leave a comment.