Welcome to Archive-It! Archive-It is a platform for you to control your data by curating your own content and running your own crawls, and then describing and sharing that content as you wish. This page is designed to give you an overview of how web archiving works with Archive-It, and is a curated collection of videos and articles from our Help Center. It is highly recommended to work through the trainings on this page before jumping into your Archive-It account. Further information can be found elsewhere in the Help Center, and can be helpful on your web archiving journey, but is not integral to starting out.
- Getting Started
- What is web archiving?
- Navigating Archive-It - This video will help you get acquainted with the Archive-It web application, pointing out where features are located and why you might want to use them.
- What to know about monitoring your data budget.
Something to keep in mind is that web archiving is not linear, but a process, so you can move through these stages as you need, even repeating some if necessary.
- Collections
- Terms to know:
- Collection - A group of archived web documents curated around a common theme, topic, or domain.
- Seed - An item in Archive-It with a unique ID number. The Seed URL is the starting point for crawlers, and as well as an access point to archived content.
- Document - Any file with a unique URL - html, image, PDF, video, etc. These are designated separately from seeds, although seeds can be documents as well
- How seeds, documents, and collections work together
- Create and manage a collection
- Select Seed URLs
- Known web archiving challenges - While Archive-It continuously works towards improvements, some websites are not created in a way that is "archive-friendly" and can be difficult to capture or replay in their entirety. Knowing these limitations up front can help you set your expectations.
- Metadata - there are multiple ways and places to add metadata. Explore these options here.
- Scoping - Each Archive-It crawl has a default set of bounds to prevent the crawls from capturing the entire web. There are multiple ways to adjust scope to capture/not capture content.
- Some sites have automatic scoping rules applied for your convenience.
- Pre-crawl Scoping - Make sure your seeds are set up correctly before you start your crawls. In this video you'll learn tips for selecting, formatting, and administering your seed URLs before you run a crawl to help capture the data you're looking for.
- Scoping guidance for specific types of sites - Some platforms need specific scoping rules for full capture and replay. Learn about our most up to date guidance for archiving popular social media services and other commonly used platforms.
- Crawling - Learn about the different ways crawl content. Please note that this section includes important information on how data is added to your account.
- Types of crawls - There are different types of crawls to best support the Archive-It workflow. For all of these crawls, it's important to note that only New Data will be added to your account, due to our data deduplication feature.
- Test Crawls - Adding new seeds or scoping rules? You'll want to make sure you run a test crawl first! This video will give you tips on how and why to run test crawls in your collections. Once saved, test crawls can not be deleted.
- One time production crawls - The alternative to a test crawl is a production crawl. These permanent crawls automatically add data to your data budget and can not be deleted.
- Scheduled crawls - Once you have run a test crawl, you’re ready to schedule crawls to run on a schedule that you set. Please note that scheduled crawls will automatically add the data to your budget and can not be deleted.
- PDF Only Crawls - Learn how to run a PDF Only crawl and how to access the archived PDFs
- Crawling technologies - Archive-It has different crawling technologies that can (and should!) be used to capture different kinds of content
- What is Brozzler, and when to use it - Brozzler is our newest crawling technology, built at the Internet Archive.
- It’s important to manage your data budget as you crawl. Only unsaved test crawls can be deleted; no other kind of crawl can be deleted.
- Reviewing & Quality Assurance
- Getting the most from your post crawl reports - So you've run a crawl... now what? This video walks through each report to provide detail on why each post-crawl report is necessary, and the information you can glean from them.
- Understanding your Hosts Report - The Host report is a wealth of information! Find out all of the different ways you can use it to identify crawler traps, block hosts, add data limits, run patch crawls, and more.
- Quality Assurance - What can you do if your archived websites don't look quite right in Wayback? Following these steps may help you improve the capture and replay of your Wayback pages.
- Access
- Control access to your web archive - You can control access to your web archive so that patrons, the general public, and/or your Archive-It users can or cannot view specific content that you designate.
- Public content/ archive-it.org
- Inside your account
- Other options
Now that you have an overview of Archive-It, we encourage you to jump in! After you’ve run some test crawls, you can sign up for Office Hours and a Web Archivist would be happy to chat with you.
And remember- web archiving is a learning experience! As you learn more about the sites that you collect, you can find and share tips here in the User Guide or Community Forum.
Comments
0 comments
Please sign in to leave a comment.