You can add Twitter feeds, including those for hashtags, to your collection in order to crawl, archive, and replay them as you would any other seed site, just so long as you remember to format and scope them according to a few simple rules:
On this page:
How to select and format your Twitter seeds
It's important to be specific when selecting your Twitter seeds. They can take the form of a specific user's feed like http://twitter.com/internetarchive/, a hashtag feed like or a specific search like /. Follow our standard guidance for /adding seeds, but remember the following principles:
- Add an ending '/' to the url, for example: http://twitter.com/internetarchive/ (with an ending /). This allows you to archive only the feed that you specify, rather than all of Twitter!
- Do not add www to your Twitter seed. Twitter URLs do not have a www by default, and www.twitter.com is blocked by a robots exclusion.
How to modify the scope of Twitter seeds
The proper formatting above enables our crawler to access Twitter feeds. To ensure that it also archives all of the proper content it finds there, and furthermore to limit it from archiving too much material from remote areas of Twitter, you may apply the following optional scope modifications:
Exclude additional languages
For any given "tweet," the page is captured in all languages that the Twitter interface supports. For example, for each original tweet's URL archived in the following format...
http://twitter.com/[user name]/[tweet ID]/
...the following URLs are also archived:
name]/[tweet ID]/?lang=ko (Korean)
name]/[tweet ID]?lang=es (Spanish)
name]/[tweet ID]?lang=fr (French)
If you prefer to prevent multiple languages from archiving, and subsequently from replaying in Wayback, limit the scope of your collection or your specific Twitter seed to block URLs that match the following regular expression: ^.*lang=(?!en).*$
When this rule is added at the collection level, twitter.com should be listed as the host.
You can adjust this regular expression to allow archiving in other languages by changing the language abbreviation in the parentheses. To archive only Spanish content, for instance, you can use ^.*lang=(?!es).*$. You will need to know the desired language abbreviation to use this rule. Please be sure to run a test crawl after adding this regular expression.
Links in Tweets
All links in tweets currently redirect through the Twitter URL shortener, https://t.co/. These links are out of scope by default, but can be scoped-in using the following rules.
- Expand the scope of your crawl to include URLs that contain the following text:
- Ignore robots.txt blocks for the following host: t.co
To prevent our crawler from archiving too much superfluous content remote to these newly archived pages:
Limit the scope of your crawl to impose a document limit on the host t.co at the collection level. Each seed and collection will be different, but a document limit of 200 - 500 on t.co is a good place to start. As always, run a test crawl with the new rule in place to make sure it has been entered properly and you're getting the content you wanted. You may need to adjust the limit depending on the results
To capture all video content, ignore robots.txt blocks for the following host: video.twimg.com
Archiving Twitter searches
Search pages should archive similarly to other Twitter feeds. However, dynamically scrolling content on search pages might not fully archive or replay. As with other Twitter feeds, be sure to include an ending / on any search seed URLs in order to avoid putting all of Twitter in scope, for example:.
What to expect from your archived Twitter feeds
Twitter feeds and searches should play back normally, taking into account the information above, with the following exception
- Images in tweets will expand within the tweet, but the full size version will not play back