Our software can effectively crawl, archive, and replay Tumblr sites successfully without scope modifications. Just remember to properly format your Tumblr site's URL when adding it as a seed; to archive a Tumblr site and all of it's component contents, remember that our default crawling scope requires that you format it with a slash ( / ) at the end. An effective seed URL for a Tumblr site, for instance, will typically look like: http://chicagoarchitecturebiennial.tumblr.com/
On occasion, Tumblr sites have been known to lead to crawler traps, which can unnecessarily deplete your account's document or data budget. Review your Hosts report to determine if our crawler is archiving unnecessarily many URLs from hosts that support Tumblr, and if so, apply the following modifications:
- Limit your crawl to block any URLs from the host domain of your seed that contain the following text: /llsid
- Limit your crawl to block any URLs from the host domain of your seed that contain the following text: ?route=
- Limit your crawl to block any URLs from the host domain of your seed that contain the following text: ?before_time=
- Limit your crawl to block all URLS from the following host: px.srvcs.tumblr.com
The above directions should best prepare our crawlers to find and archive Tumblr-hosted material, but it is important to review the results of these crawls for completeness and accuracy.
Comments
0 comments
Please sign in to leave a comment.