You can add Twitter feeds, including those for hashtags, to your collection in order to crawl, archive, and replay them as you would any other seed site, just so long as you remember to format and scope them according to a few simple rules:
On this page:
How to select and format your Twitter seeds
It's important to be specific when selecting your Twitter seeds. They can take the form of a specific user's feed like http://twitter.com/internetarchive/, a hashtag feed like or a specific search like /. Follow our standard guidance for /adding seeds, but remember the following principles:
- Add an ending '/' to the url, for example: http://twitter.com/internetarchive/ (with an ending /). This allows you to archive only the feed that you specify, rather than all of Twitter!
- Do not add www to your Twitter seed. Twitter URLs do not have a www by default, and www.twitter.com is blocked by a robots exclusion.
How to modify the scope of Twitter seeds
The proper formatting above enables our crawler to access Twitter feeds. To ensure that it also archives all of the proper content it finds there, and furthermore to limit it from archiving too much material from remote areas of Twitter, you may apply the following optional scope modifications:
Exclude additional languages
For any given "tweet," the page is captured in all languages that the Twitter interface supports. For example, for each original tweet's URL archived in the following format...
http://twitter.com/[user name]/[tweet ID]/
...the following URLs are also archived:
name]/[tweet ID]/?lang=ko (Korean)
name]/[tweet ID]?lang=es (Spanish)
name]/[tweet ID]?lang=fr (French)
If you prefer to prevent multiple languages from archiving, and subsequently from replaying in Wayback, limit the scope of your collection or your specific Twitter seed to block URLs that match the following regular expression: ^.*lang=(?!en).*$
When this rule is added at the collection level, twitter.com should be listed as the host.
You can adjust this regular expression to allow archiving in other languages by changing the language abbreviation in the parentheses. To archive only Spanish content, for instance, you can use ^.*lang=(?!es).*$ You will need to know the desired language abbreviation to use this rule. Please be sure to run a test crawl after adding this regular expression.
Alternatively, if you'd like to capture more than one language, you can adjust the regular expression by following the format of this regex, which will archive in English and French: ^.*lang=(?!en|fr).*$
Links in Tweets
All links in tweets currently redirect through the Twitter URL shortener, https://t.co/. These links are out of scope by default, but can be scoped-in using the following rules.
- Expand the scope of your crawl, preferably at the seed level, to include URLs that contain the following text:
- Ignore robots.txt blocks preferably at the seed level, or at the collection level on the host: t.co
- Document limits allow you to specify how many t.co links off your target seed(s) you want to archive each time. This rule type must be added at the collection level. Try starting with a document limit of 200-500 documents on the host t.co.
- Data limits can also be added at the seed level to specify how much data you will allow t.co to add to your crawl. Try starting with a data limit of 1GB-2GB on the seed.
- Each seed and collection will be different; run a test crawl with the new rule(s) in place to make sure they are entered properly and you're getting the content you wanted. You may need to adjust the limit(s) depending on the results.
To capture all video content, ignore robots.txt blocks for the following host: video.twimg.com This isn't necessary if you ignored robots.txt at the seed level, which means that you will capture video content by default.
What to expect from your archived Twitter feeds
Twitter feeds and searches should play back normally, taking into account the information above, with the following exceptions:
- Images in tweets will expand within the tweet, but the full size version will not play back
- Dynamically scrolling content on search pages might not fully archive or replay.