Estimated reading and video time: 90 minutes
Welcome to Archive-It! Archive-It is a web archiving service where you can curate your own content and run your own crawls, then describe and share that content as you wish. This guide is a curated collection of videos and articles from our Help Center that are most important to utilizing Archive-It; further information can be found elsewhere, but is not integral to starting out. It is highly recommended to work through the information on this page before jumping into your Archive-It account.
On this page:
- Getting Started
- Reviewing & Quality Assurance
- Description & Access
- What is web archiving?
- Navigating Archive-It - This ~9 minute video will help you get acquainted with your Archive-It account, giving you a tour that points out where features are located and why you might want to use them.
- Each partner is responsible for monitoring their account's data budget.
- Learn how to set up and administer your account.
Please keep in mind that web archiving is a process; you can repeat steps if necessary.
- Terms to know:
- Collection - A group of archived web documents curated around a common theme, topic, or domain.
- Seed - An item in Archive-It with a unique ID number. The Seed URL is the starting point for crawlers, and as well as an access point to archived content.
- Document - Any file with a unique URL - html, image, PDF, video, etc. These are designated separately from seeds, although seeds can be documents as well.
- The archived content in each collection is independent from other collections, including other Archive-It collections, and the Wayback Machine. Learn how seeds, documents, and collections work together.
- Create and manage a collection.
- Select Seed URLs.
- Known web archiving challenges - While Archive-It continuously works towards improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. Knowing these limitations up front can help you set your expectations.
- Terms to know:
After completing the readings in this section, you'll understand the multiple ways to adjust scope to capture/not capture content, and which approach is best to use when.
- Each Archive-It crawl has a default set of bounds to prevent the crawls from capturing the entire web: anything embedded on a page, as well as any related links out. You can adjust the default scope at the collection or seed level.
- Some sites have automatic scoping rules applied for your convenience.
- Pre-crawl Scoping - Make sure your seeds are set up correctly before you start your crawls. In this ~7 minute video you'll learn tips for selecting, formatting, and administering your seed URLs before you run a crawl to help capture the data you're looking for.
- Scoping guidance for specific types of sites - Some platforms need specific scoping rules for full capture and replay. Learn about our most up to date guidance for archiving popular social media services and other commonly used platforms.
After completing the readings in this section, you'll understand the different ways to crawl content, and how these decisions can impact your data budget.
- It’s important to manage your data budget as you crawl. Only unsaved test crawls can be deleted; no other kind of crawl can be deleted.
- Types of crawls - There are different types of crawls to best support the Archive-It workflow. For all of these crawls, it's important to note that only New Data will be added to your account, due to our data deduplication feature.
- Test Crawls - Test crawls hold the data temporarily for 60 days, so you can review them before saving or deleting. Once saved, test crawls can not be deleted. Because the data in a test crawl is held temporarily until it is saved, the only place to see the archived pages from a test crawl report's seed report.
- One time production crawls - The alternative to a test crawl is a production crawl. These permanent crawls automatically add data to your data budget and can not be deleted.
- Scheduled crawls - Once you have run a test crawl, you’re ready to schedule crawls to run on a schedule that you set. Please note that scheduled crawls will automatically add the data to your budget and can not be deleted.
- Crawling technologies - Archive-It has different crawling technologies that can (and should!) be used to capture different kinds of content
Best Practices for first time crawls include:
- Select 10 or fewer seeds at a time (if possible)
- Use the Standard seed type, unless you know you only want One-Page (avoid + seed types for now)
- Crawl Type - Test
- Time Limit - 5 days
- Crawling Technology - Brozzler
After completing the readings and videos in this section, you'll understand how to review your completed crawls and use the Quality Assurance tools.
Getting the most from your post crawl reports - So you've run a crawl... now what? This ~7 minute video walks through each part of the crawl report with insight into each of the three sub-reports, and the information you can glean from them.
What should I check first in a crawl report?
Understanding your Hosts Report - This ~8 minute video digs into how to use the Host Report to learn what was and was not captured in a specific crawl.
Quality Assurance - What can you do if your archived websites don't look quite right in Wayback? Following this checklist may help you improve the capture and replay of your Wayback pages.
After completing the readings this section, you'll understand the options for describing and sharing your content.
Metadata - There are multiple ways and places to describe your crawled content. Please note that adding the same metadata in multiple places will result in duplicate descriptions.
You can control access to your web archive so that patrons, the general public, and/or your Archive-It users can, or cannot, view specific content that you designate.
Use the suite of Archive-It APIs to provide access to your web archives from your own domain, and integrate with other tools and services.
Now that you have an overview of Archive-It, we encourage you to jump in! After you’ve run some test crawls, you can sign up for Office Hours and a Web Archivist would be happy to chat with you.