Archive-It's software enables partners to capture and display content hosted by popular social media services and many other commonly used platforms. These services frequently update and change the ways that they serve content, so use the guidance on this page to avoid running into any problems.
- Before you crawl
- After you crawl
- Sites with automated scoping rules
- Sites with recommended scoping guidelines
Before you crawl
Be specific with your seed URLs.
Be as specific as possible when choosing your seed URL; in other words, add only the page that you want to archive as the seed.
- DO NOT use the whole site as a seed: www.facebook.com or www.twitter.com
- DO use: http://twitter.com/internetarchive/
Double-check your seeds!
Do you need an ending / (slash) ? Please be sure to read below for specific instructions on seeds for each site. Not doing so could result in archiving millions of documents unintentionally.
Run a test crawl first
We strongly recommend a on all new seeds before performing a production (non-test) crawl. This will ensure that your seeds are configured correctly and that you won't unintentionally crawl much more content than desired at the expense of your account's data budget.
Limit your crawls
You may want to set up data and/or document limits for these sites if the test crawl shows an unusually large volume of content and you have confirmed that your seed URL is correct.
After you crawl
It is especially important to your first captures before regularly crawling your new seeds. Please look through your reports and the archived content after you run your first crawls in order to ensure that your archived content looks accurate, and that you didn't crawl more than you intended.
Sites with automated scoping rules
Recommended scoping rules exist for many popular platforms, including social media sites. Automated default scoping rules will be applied when new seed URLs from the following platforms are added to a collection or when you manually apply the rules to existing seeds:
What to expect:
Default scoping rules as outlined in “Scoping crawls for specific types of sites” will be added to new seeds.
Optional scoping rules will not be added.
Seeds with embedded content from one of these platforms, will need necessary rules added manually.
Facebook: A 3GB data limit will be added by default to Facebook seeds. To facilitate testing at varying data levels, this rule can be edited without affecting the other default Facebook rules.
Automated scoping rules for new seeds
- When you add a new seed from one of the platforms that apply automatic scoping rules, you’ll see a + icon beside the seed. On hover, this will alert you to the fact that scoping rules will be automatically applied and that these rules may include ignoring robots.txt files.
- After clicking “Add Seeds”, a banner will appear if any of your seeds had scoping rules automatically applied.
- To view the rules that were applied to your seed, click on the hyperlinked seed URL listed in the “Seeds” tab of the collection management interface. Once you are in the seed settings, navigate to the "Seed Scope" tab. Rules with a link icon in the “Controls” column indicate automatically applied group rules:
- To toggle off or delete any of the automatically applied rules, click the link icon in the “Controls” column. A dialog box will warn that, if unlinked, these rules will not be automatically updated when changes are made to our recommended scoping.
- Once you click “Confirm” to edit grouped scoping rules, you will be able to toggle individual rules on and off. Once a rule is toggled to the “off” position, the option to delete it will appear (as illustrated in the first rule listed below).
Note: To facilitate testing Facebook seeds at varying data levels, the 3GB default data limit can be deleted without affecting the other default Facebook rules, which are linked together.
Automated scoping rules for existing seeds
It is possible to bulk add default scoping rules to selected existing seeds from your seed list. Be aware that adding the default rules to existing seeds will delete all seed-level scoping rules from those seeds and replace them with the default rules. You may want to review your collection-level scoping rules and delete any that are no longer necessary.
- Select seed(s) from a collection’s seed tab by clicking the checkboxes to the left of each seed. You can select any seeds; only those with default scoping rules will have seed-levle rules applied.
- Click the “Apply Rules” button to add default scoping rules, where relevant, to the selected seeds. In the example above only half the selected seeds have default scoping rules (Facebook and YouTube). A dialog box will appear listing the templates that will be applied and to how many seeds.
- Click the “Apply” button to complete the process.
Sites with recommended scoping guidelines
Refer to the following recommended guidelines if you plan to archive content from these sites:
- Blogspot sites
- Flickr streams
- Instagram feeds
- Internet Archive audio and video
- Issuu hosted content
- Omeka sites
- Password protected sites
- Scribd hosted content
- SoundCloud audio
- Squarespace sites
- Tumblr sites
- Twitter feeds
- Vimeo videos
- Wikipedia pages
- Wix sites
- Wordpress sites
- YouTube videos