Archive-It Workshop Series Part 1: Selection, Scoping, & Crawling

This is the first in a three-part Archive-It Workshop series for new Archive-It users. It was initially developed and presented to new members of the Community Webs cohort. Other workshops in this series include Reviewing & Quality Assurance, and How to find and use web archives.

Introduction

Setting up collections and running crawls is the functional part of web archiving using Archive-It. Understanding the practical steps and the technical considerations are key to an efficient and successful web archiving workflow. Please keep in mind that different websites may require different approaches to capture fully. Web archiving is not linear, but a process; you might have to make adjustments and repeat the steps in this section more than once to get your desired results.

Objective

This training walks you through the steps of setting up a new collection, adding seeds, adjusting the scope, and running crawls.

After completing this training, you will have a better understanding of the technical considerations of how to best capture content, including formatting seeds, scoping crawls, and differences between crawl types and technologies.

Recorded Training

This workshop was presented on April 26th, 2021 as part of a training series for new Community Webs partners.

Watch Recording ▼

Recorded April 26, 2021

Materials

For this training you will need:

A collection topic
One or more websites you want to include in that collection
Your Archive-It account
The Archive-It Help Center

Learn

We're going to walk through the process of setting up, scoping and running crawls. If you are brand new to your Archive-It account you may benefit from the following video tour of the features you'll be using.

Watch Navigating Archive-It ▼

Setting up your Account

The Administration section is where you can set up and manage your overall account. This is also where you can add and edit users.

Setting Up a Collection

If you haven’t already, you will need to set up a collection (or collections) to begin archiving web content. Archive-It makes this step possible in just a few clicks. Please use the following links for instructions on Creating a New Collection and Managing Collection Settings. The collection name can be edited later.

Let’s Talk About Seeds

The websites and webpages you’ve already identified are going to be the “seeds” in your collection. Seeds point Archive-It’s crawlers to content on the live web to capture, they also point users of your public collections to content that has already been archived. For guidance on setting up collections so that they archive the content you want and provide access to that content in a functional way, it’s worthwhile to understand the relationship between seeds and archived content. We recommend reading the article How Seeds, Documents, and Collections Work Together for more information.

Formatting Your Seed URLs

The way you format your seed URLs will help determine how much content Archive-It crawlers are able to capture. You can choose to archive an entire domain or website, a specific directory or subset of a website, or even just a specific page. Before adding seeds to your collection we recommend reading the article on how to format your seeds and how Archive-It crawlers determine scope to help you get the best results.

Adding and Managing Seeds

Adding new seeds to your collection is almost as easy as setting up a new collection. You can use the instructions on the Add Seeds page in the Archive-It Help Center.

Let’s talk about Seed Settings

There are a number of different settings you will be able to select for your seeds. When you add your seeds you will be given the option to select the following settings:

Access (Public or Private) - This dictates whether a seed appears as an access point in a public collection.

Frequency - This indicates how often you want this seed to be crawled (don’t worry, you won’t start a crawl automatically when selecting a frequency. Actually kicking off the crawls requires a second step which we’ll discuss later)

Seed Type - This is a scoping option that helps tell the crawler how much or how little of a web resource to crawl (we will go over this in more detail shortly)

A seed’s settings are easily editable. If you aren’t sure what to choose, using the default settings is fine for now! If you would like to make changes at any point, you can do so to an individual seed or in bulk by following the instructions on the Managing seeds page.

Tips for seed organization

It's not currently possible to move content between collections, although that is on the Archive-It development roadmap, so thinking through seed organization beforehand can be a timesaver later. Generally, it is helpful to group platforms together in a collection, so you can add and update one set of scoping rules at the collection level, rather than the same rules across multiple collections. An example of this might be a "Social Media" collection, or an "Institutional Websites" collection. In addition, it is currently possible for scheduled crawls to run on only one crawling technology, so it can be helpful to group seeds that require Brozzler together. Please also keep in mind that deleting seeds only removes them from the Archive-It interface (which is their access point), but does not remove the archived content.

There is a function to group seeds, which is helpful for internal seed management, however this content also displays on archive-it.org, for any public content.

Setting expectations

There are some Known Web Archiving Challenges, and understanding these can help you set expectations and approach your crawling. We have two types of crawling technology: standard and Brozzler. Brozzler better handles dynamic content and there are certain situations where Brozzler is the explicitly recommended crawling technology.

Let's talk about scoping

Every crawl has a default set of bounds that determine scope. Adjusting this scope is how you can control what is, and is not, captured. Some sites need specific scoping rules in order to fully archive and it can be helpful to pre-scope seeds before they are crawled. When possible, there are automatic scoping rules applied for your convenience. However, since it's not possible to add them in every case, there is scoping guidance for specific types of sites; not adding the required scoping rules can result in a seed being not fully captured, or capturing too much irrelevant data.

There are two ways to add specific scoping rules: at the seed and collection level. It's also possible to add scope rules in bulk. Changing the seed type is another way to adjust the scope.

Types of crawls

There are different types of crawls to best support the Archive-It workflow. For all of these crawls, it's important to note that only New Data will be added to your account, due to our data deduplication feature.

Test Crawls - Adding new seeds or scoping rules? You'll want to make sure you run a test crawl first! Once saved, test crawls can not be deleted.
One time production crawls - The alternative to a test crawl is a production crawl. These permanent crawls automatically add data to your data budget and can not be deleted.
Scheduled crawls - Once you have run a test crawl, you’re ready to schedule crawls to run on a schedule that you set. Please note that scheduled crawls will automatically add the data to your budget and can not be deleted.
PDF Only Crawls - A PDF-only crawl captures content, but only makes the PDFs available via the crawl report. You can run these as either test or production crawls. This video clarifies how to access the PDFs.

When and how to Integrate external WARCs

You can also upload web archive files that have originated elsewhere, to live alongside your Archive-It crawled content. Do keep in mind that because different technologies are at play, there is no guarantee that the indexed contents of these files, especially videos, will replay in Archive-It’s Wayback mode precisely as they do in other separate and/or legacy replay mechanisms.

To Do

Using the information you've learned in this section, complete the following steps:

If you haven't already, set up a new collection.
Add a seed (or seeds), using Help center search to determine if any additional scoping rules are needed. Add any recommended rules.
Determine whether any automatic scoping rules were added.
Run a few test crawls on your seeds, experimenting with the number of seeds in each crawl, crawling technology, and crawl settings to understand the ways that different settings can have different outcomes.

The next training will cover reviewing completed crawls for Quality Assurance. Before moving on, please have at least one saved test or production crawl ready to review.

Articles in this section

On this page:

Introduction

Objective

Recorded Training

Materials

Learn

Setting up your Account

Setting Up a Collection

Let’s Talk About Seeds

Formatting Your Seed URLs

Adding and Managing Seeds

Let’s talk about Seed Settings

Tips for seed organization

Setting expectations

Let's talk about scoping

Types of crawls

When and how to Integrate external WARCs

To Do

Comments

Articles in this section

On this page:

Introduction

Objective

Recorded Training

Materials

Learn

Setting up your Account

Setting Up a Collection

Let’s Talk About Seeds

Formatting Your Seed URLs

Adding and Managing Seeds

Let’s talk about Seed Settings

Tips for seed organization

Setting expectations

Let's talk about scoping

Types of crawls

When and how to Integrate external WARCs

To Do

Related articles