Archive-It 3.0 Release Notes

Archive-It 3.0
March 3, 2009

Administration

In order to better support our partners and streamline our processes, we will now hold regularly scheduled Archive-It application trainings twice a month. Anyone using the application is welcome to attend these sessions. A calendar of upcoming trainings and sign up information is available on the Archive-It help wiki here: http://webteam.archive.org/confluence/x/gQAO

Please enter your email address into the Archive-It "settings" page for your user account. We have added a new "forgot password" feature to the login page and it relies on your email address.

Your usernames and passwords are now case sensitive. So if you have a capital letter in your username or password, please enter it as such when logging in. If you are having any trouble logging, please contact the Archive-It partner specialist at archive-itsupport at archive.org.

New Features in 3.0

We are releasing our discovery and selection tool, Scope-It, as a Beta to our partners for a limited time. We encourage you to test Scope-It and use it to more effectively manage your collections. We will be continuing development and enhancement on this tool and welcome your feedback and input on any issues, bugs or other before we finalize the tool.

Scope-It is a discovery tool allowing Archive-It partners to more effectively scope new and existing collections. Additionally, Scope-It allows partners to analyze crawl data and remove select hosts from being captured in the future.

Scope-It is different from the existing Archive-It test crawl feature in two key ways. First, Scope-It includes an analysis component called the scope-it crawl explorer. In the existing test crawl feature, partners can see what would have been captured, however they are not able to make scoping changes to the crawl. The crawl explorer in Scope-It allows partners to easily remove hosts from all future crawls for a collection.

The other key difference between Scope-It and a test crawl is that partners can use Scope-It on a set of seeds before they formally set up a collection. To use the existing test crawl feature a collection must already exist.

There are two ways to use Scope-It (instructions for both methods are included further below):

1) The tool can scope a new set of seeds before creating a new collection.

2) The tool can be used to analyze completed crawls in existing collections and revise scope for future crawls.

Using Scope-It for New Collections

Partners can use Scope-It before they create a collection. This will allow them to prevent designated hosts from ever being archived. During this new collection process, partners will submit seeds to run in a Scope-It crawl. This crawl will run just like an existing test crawl and gather no data, but produce all post crawl reports. When the crawl is complete, partners can select and remove individual hosts from future crawls using the scope-it crawl explorer. As a final step, partners will create a collection from the seeds and accompanying crawling rules that they just specified. This collection will then be scheduled for ongoing crawls per the partner's specifications.

Below, please find step-by-step instructions for using Scope-It in coordination with new collections.

1. Select Scope -It from the collections drop down menu.

2. From the Scope-It main landing page, select crawl seeds and import to create a new collection using Scope-It.

3. Add the seeds you would like to crawl when prompted.

4. Specify how long the initial crawl should last. Please note that normal crawl duration does vary depending on the frequency selected. Therefore if a partner would like to truly simulate a full crawl instance they will need to select the Scope-It crawl duration that matches their frequency.

Partners can also specify a limit on the number of documents to be crawled in the Scope-It crawl instead of choosing a crawl duration.

5. Start the crawl by clicking start crawl. Partners can monitor the Scope-It crawl from the current or Scope-It crawls tabs.

6. Analyze and exclude hosts using the scope-it crawl explorer. To find the completed crawl, find your crawl on the scope-it crawls tab (crawls → scope-it crawls). Then click scope-it crawl explorer next to the crawl you would like to analyze. Once selected, the crawl data will be imported into the scope-it crawl explorer. The import process will take 30 minutes or less for most crawls. However, crawls over 2.5 million documents (URLs) can take up to 12 - 24 hours to import into the crawl explorer.

The scope-it crawl explorer will list all hosts discovered during the crawl. Next to each host you will see a number referring to the number of documents (URLs) discovered for that host. Click each host to view the types of files discovered (see figure 5).

To the left of each listed host you will an exclude link. To completely remove the listed host from future crawls, click the exclude link.

Once you have selected all hosts you want excluded in future crawls, you can review your choices in the excluded hosts tab.

7. Create a collection from the seeds and exclusions you just designated by clicking the create collection using these exclusions link.
You will then be walked through the general collection creation steps. Complete the process as you normally would, entering the default crawl frequency and initial metadata for the collection. The application will include the seeds used for the initial Scope-It crawl in your new collection.

Running Scope-It on Existing Collections

Partners can use Scope-It to analyze previously completed crawls in existing collections, and make decisions about scoping for collections that already exist using the scope-it crawl explorer. Once partners have excluded hosts in the crawl explorer, the rules will then be applied to the existing collection and take effect during the next scheduled crawl. The new settings will only affect the collection you have been analyzing; the rules will not apply to other collections.

Below, please find step-by-step instructions for running scope-it on existing collections.

1. From the Scope-It landing page (collections → scope-it), click Import Completed Crawl.

2. Select the crawl you would like to analyze. You will see a list of completed crawls; chose the crawl you want to analyze by clicking the "select" link next to that crawl.

The import process will take 30 minutes or less for most crawls. However, crawls over 2.5 million documents (URLs) can take up to 12 - 24 hours to import into the crawl explorer. Return to the Scope-It crawls tab to access the crawl explorer (crawls → scope-it crawls).

3. Analyze and exclude hosts using the crawl explorer. The scope-it crawl explorer will list all hosts discovered during the crawl. Next to each host you will see a number referring to the number of documents (URLs) discovered for that host. Click each host to view the types of files discovered.

To the left of each listed host you will see a check box. To completely remove the listed host from future crawls, click the check box. Once you have selected all hosts you want excluded in future crawls, you can review your choices in the excluded hosts tab.

4. Click "apply exclusions to existing collection" to apply the new exclusion rules to your collection. These new settings will take effect during the next scheduled crawl.

Search by Institution

From the public Archive-It site, partners and their patrons can now search all collections created by an institution. On the public Archive-It partner pages, your patrons will be able to search across all the collections you have created. The public will still be able to search across individual collections from public collection pages as well.

This search feature is currently only available from the public interface to the collections at www.archive-it.org. We will include this feature from inside the web application in the near future.

Partners can include this style search on their own collection interfaces as well. Just ask the Archive-It partner specialist to send you the html code snippet you will need for your website.

Application Interface Changes

For Archive-It 3.0, we have developed a cleaner user experience throughout the application. All the existing functionality remains within Archive-It 3.0 (as well as some new features), however you will find that there have been some minor changes to the appearance of the application. If you have questions, please contact the Archive-It partner specialist at archive-itsupport at archive.org.

In addition to the minor changes, we have made two larger changes to Archive-It. More information is described below.

Running Test Crawls

There is a new process in place for running test crawls in Archive-It. Previously partners scheduled seeds they wanted to test by assigning them to the test frequency option and manually starting test crawls. In the new version of Archive-It, there is no longer a test frequency option. However there are still two ways partners can run test crawls.

(1) When partners are setting up a new collection they will now have the opportunity to run a test crawl on all seeds in the new collection. During the set up process, partners will select to run a test crawl instead of assigning a crawl frequency to the seeds.

At the end of the set up process partners will select to "run a test crawl now" and a test crawl up to 3 days will begin immediately.

Once the test crawl has finished and partners have reviewed the information, they can re-assign the seeds to their preferred frequencies. In the interim (while the test crawl is running and before seeds have been re-assigned), seeds will be available under the one-time frequency option.

(2) Partners can also run test crawls of selected seeds within a collection that has already been created. The test crawl process begins from the seed management page. Partners should select seeds they want to test by checking the boxes next to the appropriate seed. Once all desired seeds have been selected, partners will click the run test crawl link in the upper right hand corner of the screen.

After clicking the link, partners will be directed to their seed management page. Partners can also begin the test crawl process from this page. Partners should select seeds they want to test by checking the boxes next to the appropriate seed. Once all desired seeds have been selected, partners will click the run test crawl link in the upper right hand corner of the screen.

Partners will have the ability to set the duration for the test crawl. The crawl can last 10 minutes at the shortest and 7 days at the longest duration. Keep in mind that most Archive-It crawls last 3 or 5 days. There is more information on these frequencies in the Archive-It help wiki. Partners can also specify a limit on the number of documents to be crawled in the test crawl instead of choosing a crawl duration.

Scope Rules
If a partner would like to expand the scope of their crawl using scope rules (also known as SURTs), this feature is now found on the expand scope tab under the modify crawl scope link. This type of expanded scope is generally used to crawl sub domains.

Infrastructure Improvements

There are some upgrades and enhancements we have made behind the scenes at Archive-It. These changes do not require partners to use the system differently. However you will find improvements as a result of these structural enhancements.

Nutchwax .12

We have recently completed a major software upgrade of the Nutchwax software Archive-It relies on for full text search. Partners will notice a difference in the overall performance of the search engine as well as better ranking and relevancy in the search results.

With this new upgrade, partners will now also have the ability to search across all of an institution's collections from the public Archive-It site. You can access this new search feature from our public site at www.archive-it.org.

Please send the Archive-It team feedback on your experience with the new search technology.

De-Duplication Crawling Features

The Archive-It system is now running all crawls with the Heritrix de-duplication features turned on. This means that for all future crawls only new and/or changed content will be archived from one crawl to the next.

This feature will make an impact on how quickly partners use their data budgets, which partners will be able to monitor from the Partner Overview page.

Please note that document budgets will not change since in order to discover all the new and changed content on a website we still have to use crawler resources to explore your seeds completely.

Your first crawl after this 3.0 release will be a full crawl and will consider all content new and changed. After this initial crawl, only new and changed content will be stored.

File Format

Archive-It will now be preserving data in the WARC file format. The WARC file is the next generation preservation format for web archiving after ARC file and has been made an ISO standard. Internet Archive's open source access tools for web archives (Wayback Machine and Nutchwax) work with both ARC and WARC files, so there will be no change in how partners and their patrons view or interact with the collections. More information about the WARC standard is available here:http://bibnum.bnf.fr/WARC/index.html

Articles in this section

Administration

New Features in 3.0

Using Scope-It for New Collections

Running Scope-It on Existing Collections

Application Interface Changes

Infrastructure Improvements

Comments

Articles in this section

Administration

New Features in 3.0

Using Scope-It for New Collections

Running Scope-It on Existing Collections

Application Interface Changes

Infrastructure Improvements

Related articles