Estimated reading and video time: 90 minutes
Welcome to Archive-It! Archive-It is a web archiving service where you can curate your own content and run your own crawls, then describe and share that content as you wish. This guide is a curated collection of videos and articles from our Help Center that are most important to utilizing Archive-It; further information can be found elsewhere, but is not integral to starting out. It is highly recommended to work through the information on this page before jumping into your Archive-It account.
On this page:
- Getting Started
- Collections
- Scoping
- Crawling
- Reviewing & Quality Assurance
- Description & Access
-
Getting Started
After completing the readings and video in this section, you'll understand the Archive-It approach to web archiving, and have a high level view of Archive-It.- What is web archiving?
- Navigating Archive-It - This ~9 minute video will help you get acquainted with the Archive-It web application, pointing out where features are located and why you might want to use them.
- Each partner is responsible for monitoring their account's data budget.
Please keep in mind that web archiving is a process; you can repeat steps if necessary.
-
Collections
Archive-It is primarily organized around collections. After completing the readings in this section, you'll understand how to build collections.- Terms to know:
- Collection - A group of archived web documents curated around a common theme, topic, or domain.
- Seed - An item in Archive-It with a unique ID number. The Seed URL is the starting point for crawlers, and as well as an access point to archived content.
- Document - Any file with a unique URL - html, image, PDF, video, etc. These are designated separately from seeds, although seeds can be documents as well.
- The archived content in each collection is independent from other collections, including other Archive-It collections, and the Wayback Machine. Learn how seeds, documents, and collections work together.
- Create and manage a collection.
- Select Seed URLs.
- Known web archiving challenges - While Archive-It continuously works towards improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. Knowing these limitations up front can help you set your expectations.
- Terms to know:
-
Scoping
After completing the readings in this section, you'll understand the multiple ways to adjust scope to capture/not capture content, and which approach is best to use when.-
- Each Archive-It crawl has a default set of bounds to prevent the crawls from capturing the entire web: anything embedded on a page, as well as any related links out. You can adjust the default scope at the collection or seed level.
- Some sites have automatic scoping rules applied for your convenience.
- Pre-crawl Scoping - Make sure your seeds are set up correctly before you start your crawls. In this ~7 minute video you'll learn tips for selecting, formatting, and administering your seed URLs before you run a crawl to help capture the data you're looking for.
- Scoping guidance for specific types of sites - Some platforms need specific scoping rules for full capture and replay. Learn about our most up to date guidance for archiving popular social media services and other commonly used platforms.
-
-
Crawling
After completing the readings in this section, you'll understand the different ways to crawl content, and how these decisions can impact your data budget.
- It’s important to manage your data budget as you crawl. Only unsaved test crawls can be deleted; no other kind of crawl can be deleted.
- Types of crawls - There are different types of crawls to best support the Archive-It workflow. For all of these crawls, it's important to note that only New Data will be added to your account, due to our data deduplication feature.
- Test Crawls - Test crawls hold the data temporarily for 60 days, so you can review them before saving or deleting. Once saved, test crawls can not be deleted. Because the data in a test crawl is held temporarily until it is saved, the only place to see the archived pages from a test crawl report's seed report.
- One time production crawls - The alternative to a test crawl is a production crawl. These permanent crawls automatically add data to your data budget and can not be deleted.
- Scheduled crawls - Once you have run a test crawl, you’re ready to schedule crawls to run on a schedule that you set. Please note that scheduled crawls will automatically add the data to your budget and can not be deleted.
- Crawling technologies - Archive-It has different crawling technologies that can (and should!) be used to capture different kinds of content
- What is Brozzler, and when to use it - Brozzler is our newest crawling technology, built at the Internet Archive.
- What is Brozzler, and when to use it - Brozzler is our newest crawling technology, built at the Internet Archive.
-
Reviewing & Quality Assurance
After completing the readings and videos in this section, you'll understand how to review your completed crawls and use the Quality Assurance tools.- Getting the most from your post crawl reports - So you've run a crawl... now what? This ~7 minute video walks through each part of the crawl report with insight into each of the three sub-reports, and the information you can glean from them.
- What should I check first in a crawl report?
- Understanding your Hosts Report - This ~8 minute video digs into how to use the Host Report to learn what was and was not captured in a specific crawl.
- Quality Assurance - What can you do if your archived websites don't look quite right in Wayback? Following this checklist may help you improve the capture and replay of your Wayback pages.
-
Description & Access
After completing the readings this section, you'll understand the options for describing and sharing your content.- Metadata - There are multiple ways and places to describe your crawled content. Please note that adding the same metadata in multiple places will result in duplicate descriptions.
- You can control access to your web archive so that patrons, the general public, and/or your Archive-It users can, or cannot, view specific content that you designate.
- Access your archived pages from inside your account or via archive-it.org.
- Use the suite of Archive-It APIs to provide access to your web archives from your own domain, and integrate with other tools and services.
Now that you have an overview of Archive-It, we encourage you to jump in! After you’ve run some test crawls, you can sign up for Office Hours and a Web Archivist would be happy to chat with you.
And remember- web archiving is a learning experience! As you learn more about the sites that you collect, you can find and share tips here in the User Guide or Community Forum.
Comments
0 comments
Please sign in to leave a comment.