Welcome to your Archive-It account! Setting up collections and crawling content is the functional part of web archiving. Websites can vary in their infrastructure and content, requiring different approaches to capture. Understanding the practical steps and the technical considerations are key to an efficient and successful web archiving workflow. Please keep in mind that web archiving is not linear, but a process; you might have to run a crawl more than once, to adjust scoping.
On this page:
- To Do
- Recorded Training
This training walks you through the steps of setting up a new collection, adding seeds, adjusting the scope, and running crawls.
After completing this training, you will have a better understanding of the technical considerations of how to best capture content, including formatting seeds, scoping crawls, and differences between crawl types and technologies.
- An idea for a collection topic and a list of websites you want to include in that collection
- The Archive-It Help Center
- Your Archive-It account
For this training you will want to have a set of websites or URLs in mind that you would like to use to set up your first collection. These are the URLs that you will use as Seeds in your first collection. As a first step, please watch this overview video to take a tour of your Archive-It account: https://archive.org/details/NavigatingArchiveIt1.
Setting up your Account
The Administration section is where you can set up and manage your overall account. This is also where you can add and edit users.
Setting Up a Collection
If you haven’t already, you will need to set up a collection (or collections) to begin archiving web content. Archive-It makes this step possible in just a few clicks. Please use the following links for instructions on Creating a New Collection and Managing Collection Settings. The collection name can be edited later.
Let’s Talk About Seeds
The websites and webpages you’ve already identified are going to be the “seeds” in your collection. Seeds point Archive-It’s crawlers to content on the live web to capture, they also point users of your public collections to content that has already been archived. For guidance on setting up collections so that they archive the content you want and provide access to that content in a functional way, it’s worthwhile to understand the relationship between seeds and archived content. We recommend reading the article How Seeds, Documents, and Collections Work Together for more information.
Formatting Your Seed URLs
The way you format your seed URLs will help determine how much content Archive-It crawlers are able to capture. You can choose to archive an entire domain or website, a specific directory or subset of a website, or even just a specific page. Before adding seeds to your collection we recommend reading the article on how to format your seeds and how Archive-It crawlers determine scope to help you get the best results.
Adding and Managing Seeds
Adding new seeds to your collection is almost as easy as setting up a new collection. You can use the instructions on the Add Seeds page in the Archive-It Help Center.
Let’s talk about Seed Settings
There are a number of different settings you will be able to select for your seeds. When you add your seeds you will be given the option to select the following settings:
Access (Public or Private) - This dictates whether a seed appears as an access point in a public collection.
Frequency - This indicates how often you want this seed to be crawled (don’t worry, you won’t start a crawl automatically when selecting a frequency. Actually kicking off the crawls requires a second step which we’ll discuss later)
Seed Type - This is a scoping option that helps tell the crawler how much or how little of a web resource to crawl (we will go over this in more detail shortly)
A seed’s settings are easily editable. If you aren’t sure what to choose, using the default settings is fine for now! If you would like to make changes at any point, you can do so to an individual seed or in bulk by following the instructions on the Managing seeds page.
Tips for seed organization
It's not currently possible to move content between collections, although that is on the Archive-It development roadmap, so thinking through seed organization beforehand can be a timesaver later. Generally, it is helpful to group platforms together in a collection, so you can add and update one set of scoping rules at the collection level, rather than the same rules across multiple collections. An example of this might be a "Social Media" collection, or an "Institutional Websites" collection. In addition, it is currently possible for scheduled crawls to run on only one crawling technology, so it can be helpful to group seeds that require Brozzler together. Please also keep in mind that deleting seeds only removes them from the Archive-It interface (which is their access point), but does not remove the archived content.
There is a function to group seeds, which is helpful for internal seed management, however this content also displays on archive-it.org, for any public content.
There are some Known Web Archiving Challenges, and understanding these can help you set expectations and approach your crawling. We have two types of crawling technology: standard and Brozzler. Brozzler better handles dynamic content and there are certain situations where Brozzler is the explicitly recommended crawling technology.
Let's talk about scoping
Every crawl has a default set of bounds that determine scope. Adjusting this scope is how you can control what is, and is not, captured. Some sites need specific scoping rules in order to fully archive and it can be helpful to pre-scope seeds before they are crawled. When possible, there are automatic scoping rules applied for your convenience. However, since it's not possible to add them in every case, there is scoping guidance for specific types of sites; not adding the required scoping rules can result in a seed being not fully captured, or capturing too much irrelevant data.
Types of crawls
There are different types of crawls to best support the Archive-It workflow. For all of these crawls, it's important to note that only New Data will be added to your account, due to our data deduplication feature.
- Test Crawls - Adding new seeds or scoping rules? You'll want to make sure you run a test crawl first! Once saved, test crawls can not be deleted.
- One time production crawls - The alternative to a test crawl is a production crawl. These permanent crawls automatically add data to your data budget and can not be deleted.
- Scheduled crawls - Once you have run a test crawl, you’re ready to schedule crawls to run on a schedule that you set. Please note that scheduled crawls will automatically add the data to your budget and can not be deleted.
- PDF Only Crawls - A PDF-only crawl captures content, but only makes the PDFs available via the crawl report. You can run these as either test or production crawls. This video clarifies how to access the PDFs.
When and how to Integrate external WARCs
You can also upload web archive files that have originated elsewhere, to live alongside your Archive-It crawled content. Do keep in mind that because different technologies are at play, there is no guarantee that the indexed contents of these files, especially videos, will replay in Archive-It’s Wayback mode precisely as they do in other separate and/or legacy replay mechanisms.
Using the information you've learned in this section, complete the following steps:
- If you haven't already, set up a new collection.
- Add a seed (or seeds), using Help center search to determine if any additional scoping rules are needed. Add any recommended rules.
- Determine whether any automatic scoping rules were added.
- Run a few test crawls on your seeds, experimenting with the number of seeds in each crawl, crawling technology, and crawl settings to understand the ways that different settings can have different outcomes.
The next training will cover reviewing completed crawls for Quality Assurance. Before moving on, please have at least one saved test or production crawl ready to review.
This workshop was presented as part of a training series for new Community Webs partners. Recorded April 26, 2021.