Overview

Estimated reading and video time: 90 minutes

Welcome to Archive-It! Archive-It is a web archiving service where you can curate your own content and run your own crawls, then describe and share that content as you wish. This guide is a curated collection of videos and articles from our Help Center that are most important to utilizing Archive-It; further information can be found elsewhere, but is not integral to starting out. It is highly recommended to work through the information on this page before jumping into your Archive-It account.

On this page:

Getting Started
Collections
Scoping
Crawling
Reviewing & Quality Assurance
Description & Access

Getting Started

After completing the readings and video in this section, you'll understand the Archive-It approach to web archiving, and have a high level view of Archive-It:

What is web archiving?
Navigating Archive-It - This ~9 minute video will help you get acquainted with your Archive-It account, giving you a tour that points out where features are located and why you might want to use them.
Each partner is responsible for monitoring their account's data budget.
Learn how to set up and administer your account.

Please keep in mind that web archiving is a process; you can repeat steps if necessary.

Collections

Archive-It is primarily organized around collections. After completing the readings in this section, you'll understand how to build collections:

Terms to know:
- Collection - A group of archived web documents curated around a common theme, topic, or domain.
- Seed - An item in Archive-It with a unique ID number. The Seed URL is the starting point for crawlers and also an access point to archived pages in Wayback.
- Document - Any file with a unique URL - HTML, image, PDF, video, etc. These are designated separately from seeds, although seeds can be documents as well.
The archived pages in each collection are independent from other collections, including other Archive-It collections, and the Wayback Machine. Learn how seeds, documents, and collections work together.
Create and manage a collection.
Select Seed URLs.
Known web archiving challenges - While Archive-It continuously works towards improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. Knowing these limitations up front can help you set your expectations.

Scoping

After completing the readings in this section, you'll understand the multiple ways to adjust scope to capture/not capture content, and which approach is best to use when:

Each Archive-It crawl has a default set of bounds to prevent the crawls from capturing the entire web: this includes anything embedded on a page, as well as any related links out. You can adjust the default scope at the collection or seed level.
Some sites have automatic scoping rules applied for your convenience.
Pre-crawl Scoping - Make sure your seeds are set up correctly before you start your crawls. In this ~7-minute video, you'll learn tips for selecting, formatting, and administering your seed URLs before you run a crawl to help capture the data you're looking for.
Scoping guidance for specific types of sites - Some platforms need specific scoping rules for full capture and replay. Learn about our most up-to-date guidance for archiving popular social media services and other commonly used platforms.

Crawling

After completing the readings in this section, you'll understand the different ways to crawl content, and how these decisions can impact your data budget:

It’s important to manage your data budget as you crawl. Only unsaved test crawls can be deleted; no other kind of crawl can be deleted.
Types of crawls - There are different types of crawls to best support the Archive-It workflow. For all of these crawls, it's important to note that only New Data will be added to your account, due to our data deduplication feature.
- Test Crawls - Test crawls hold the data temporarily for 60 days, so you can review them before saving or deleting. Once saved, test crawls can not be deleted. Because the data in a test crawl is held temporarily until it is saved, the only place to see the archived pages from a test crawl report's seed report.
- One time production crawls - The alternative to a test crawl is a production crawl. These permanent crawls automatically add data to your data budget and can not be deleted.
- Scheduled crawls - Once you have run a test crawl, you’re ready to schedule crawls to run on a schedule that you set. Please note that scheduled crawls will automatically add the data to your budget and can not be deleted.
Crawling technologies - Archive-It has different crawling technologies that can (and should!) be used to capture different kinds of content

Best Practices for first time crawls include:

Select 10 or fewer seeds at a time (if possible)
Use the Standard seed type, unless you know you only want One-Page (avoid + seed types for now)
Crawl Type - Test
Time Limit - 5 days
Crawling Technology - Brozzler

Reviewing & Quality Assurance

After completing the readings and videos in this section, you'll understand how to review your completed crawls and use the Quality Assurance tools:
Getting the most from your post crawl reports - So you've run a crawl... now what? This ~7 minute video walks through each part of the crawl report with insight into each of the three sub-reports, and the information you can glean from them.
What should I check first in a crawl report?
Understanding your Hosts Report - This ~8 minute video digs into how to use the Host Report to learn what was and was not captured in a specific crawl.
Quality Assurance - What can you do if your archived websites don't look quite right in Wayback? Following this checklist may help you improve the capture and replay of your Wayback pages.

Description & Access

After completing the readings this section, you'll understand the options for describing and sharing your content:

Metadata - There are multiple ways and places to describe your crawled content. Please note that adding the same metadata in multiple places will result in duplicate descriptions.
You can control access to your web archive so that patrons, the general public, and/or your Archive-It users can, or cannot, view specific content that you designate.
Access your archived pages from inside your account or via archive-it.org.
Use the suite of Archive-It APIs to provide access to your web archives from your own domain, and integrate with other tools and services.

Conclusion

Now that you have an overview of Archive-It, we encourage you to jump in! After you’ve run some test crawls, you can sign up for Office Hours and a Web Archivist would be happy to chat with you.

And remember- web archiving is a learning experience! As you learn more about the sites that you collect, you can find and share tips here in the User Guide or Community Forum.

Articles in this section

Guide for new Archive-It users

Overview

Getting Started

Collections

Scoping

Crawling

Best Practices for first time crawls include:

Reviewing & Quality Assurance

Description & Access

Conclusion

Comments

Articles in this section

Overview

Getting Started

Collections

Scoping

Crawling

Best Practices for first time crawls include:

Reviewing & Quality Assurance

Description & Access

Conclusion

Related articles