Archive-It 1.0 Release Notes

Administration
Documentation
Creating Collections
Collection Management
Collection Monitoring
Access
Features in Development

Administration

-Archive-It will assign one system administrator user account (name and password) to each institution. System administrator user can then assign up to 10 user accounts within an Institution. All user accounts (including system administrator) have full access to the application (they can add/modify/disable seeds, etc). The ability to create "read-only" user accounts will be available later this year.

-Internet Archive has hired a partner specialist, Molly Bragg to support Archive-It subscribers as needed. You can reach her by email at archive-itsupport at archive.org.

Documentation

-A limited help section is available when you click on the "help" button at the top of the screen when you are logged in. For this release help documentation includes, a glossary of commonly used web archiving and Archive-It terms, information about the Dublin Core metadata standard, advanced search options, a document regarding seed selection and scope and these release notes. More information will be added as it becomes available. Feel free to send suggestions for more help topics to archive-itsupport at archive.org.

-The following reports are available to aid your institution in post crawl analysis.

-Host Report: specifies each host crawled and how many URLs were collected from each host as well as the total byte size of the collected documents.

-MIME type report: lists all MIME types collected and how many URLs were captured for each type as well as total byte size of the archived URLs per MIME type.

-Seed status report: shows if the seed was crawled and if not why (ex blocked by robots.txt).

-All reports are downloadable to a tab delimited text file, which can be imported into Excel.

-A new seed report will be available before the April Archive-It release. For each seed, this report will list the unique hosts and number of URLs archived per host in relation to the seed.

Creating Collections

-The total seed budget is 300 seeds distributed across a maximum of 3 collections, or a minimum of one collection.

-Crawl frequency options:

-seeds and/or collections can be crawled daily, weekly, monthly or quarterly

-you can set the crawl frequency per collection (all seeds in one collection are crawled at the same frequency)

-crawl frequency can be determined on a per seed basis (in other words, every seed can have a different crawling rate regardless of collection).

-If you have your seeds set to different frequencies, you will see separate windows (per frequency selected) per crawl instance under Monitor (see below).

-"One hop off" crawling is NOT enabled feature of the crawler. The crawler will capture all embedded documents, but will not follow links to hosts that are not in scope.

Collection Management

-Currently, the following metadata fields are available for cataloging at the collection and seed level:

-Title
-Description
-Identifier
-Subject
-Coverage
-Language
-Publisher
-Collector - generally the institution doing the crawl
-Annotation - a place for general comments

-You can add multiple subject fields separated by semicolon to the "subject" catalog field (ex: California; agriculture). In the coming months the Archive-It team will conduct a survey on preferred standard subject headings to be added in as a default list (in addition to custom fields). These results will be incorporated into the next release of the application.

-Ability to sort seeds in a collection by metadata field, URL and crawl frequency in the seeds management interface. While these filters cannot be saved yet, if you have a sort in place and log out, the same filter will be up when you re-login. This capability is specific by collection.

-You can disable and/or hide from view individual seeds for a collection.

-For the April release (if not sooner), we will incorporate the remaining elements from the Dublin Core Metadata Element Set version 1.1. An xml feed from the metadata will be available so you can add it to your own institution's catalogs. The remaining metadata fields to add are:

-Date
-Type
-Format
-Source
-Relation
-Rights

-Additionally for the April release, automatically harvested and duplicate information can be set to auto-populate metadata fields across your seeds.

-Currently we are investigating a way to make the metadata fields searchable as part of the general text search engine.

-The application will let you know when your next crawls are scheduled to begin under the "manage" interface.

Collection Monitoring

-Users will see the cumulative total of crawled documents as well as percentage of crawl budget used for your Institution when you login.

Access

Search:

-Each search result will display the last date of capture.

-The latest relevant result for a given search query will be displayed in the search results (showing only 2 search results for a given host); all duplicates have been removed from the index.

-Advanced text search options include (these only work from the public search UI):

-Boolean "and" is the default for multiple terms and minus works as well. For example, entering Pope Rome will give results containing Pope and Rome only. Also Pope -Rome will give results containing Pope but not Rome.

You can select search results by a specified date range (formatted as year/month/day/time-year/month/day/time). For example, 20051204000000-20051206000000 pope will return all documents from Dec 4 2005- Dec 6 2005 GMT in the collection containing the search term pope.

-You can reorder search results by date by adding &sort=date to the end of the search query.

-To sort results by descending date order add the following to your search query: &sort=date&reverse=true

-You can search for file type by adding type:xxx (xxx stands for file extension type) to your search query.

-You can search for a specific host by entering host:xxx (where xxx is the name of the host).

Wayback:

-The Collection name assigned by institution is displayed in the Wayback interface.

-Current collections are linked to historical collections in the general Wayback by clickable links in the Wayback search results.

Features in Development

-The Archive-It team is currently building an online help section for subscribers, including helpful how-tos and a glossary. Some of this will be included in the February release and more content will be added as it becomes available.

-Template web pages for partners to show their collections on http://www.archive-it.org/ will be available before the April release.

-Potential tools for subscriber feedback, discussion and general user community development such as forums, blogs, and lists are being investigated. Pilot partners will be contacted for their ideas in February/March.

-Archive-It is developing a way to filter search results by metadata fields and other advanced search options.

-Metadata changes mentioned above in Collection Management will be live by April. Essentially metadata fields will comply with the Dublin Core Metadata Element Set, version 1.1.

-New links will be added to the public site so users can easily access the most viewed, most recently added, and other featured collections.

-User stats will be available per collection before April.

-A seed report will outline how many documents and what hosts were captured from your seed URLs.

-A bookmark-type interface for storing sites of interest from the live web before adding them to your seed list is in development.

Articles in this section

Archive-It 1.0 Release Notes

Administration

Documentation

Creating Collections

Collection Management

Collection Monitoring

Access

Search:

Wayback:

Features in Development

Comments

Articles in this section

Administration

Documentation

Creating Collections

Collection Management

Collection Monitoring

Access

Search:

Wayback:

Features in Development

Related articles