Overview
The Archive-It Partner API provides access to information about Archive-It partners, collections, and crawls outside of your Archive-It account or pages on archive-it.org. Credentialed Archive-It users can query this data from Archive-It's database in a web browser or from the command line.
Partners can retrieve this information to develop custom access layers, manage administrative or descriptive metadata externally, or run periodic account usage reports.
Prerequisites:
- How to set up and administer your Archive-It account
- How to create and manage a collection
- How to find your Archive-It collection's ID number
On this page:
Instructions
Basic use
Access the Archive-It Partner API as a public user or credentialed Archive-It partner at https://partner.archive-it.org/api/
Data is stored in the following tables.
Account
|
Data table |
Description |
|
Account-level partner metadata, such as organization name, data budget, and public access URL |
|
|
Information about an authorized Archive-It user's own account, such as username and account name, creation and update timestamps, and email address |
|
|
/user |
Authorized users of Archive-It account, including usernames, access levels, and email addresses |
Collections
|
Data table |
Description |
|
Information about Archive-It collections, including names, ID numbers, active statuses, and descriptive metadata |
|
|
Descriptive metadata fields and values added to individual documents in Archive-It collections, such as titles, creators, subjects, etc. |
|
|
Information about free text “notes” fields added by Archive-It partners, including timestamps, text, and creator usernames |
|
|
Log of W/ARC files manually uploaded to an Archive-It account, including filenames, collection numbers, timestamps, and checksum values |
Seeds
|
Data table |
Description |
|
Metadata about all seeds in an Archive-It account, including URLs, seed types, creation and update timestamps, groups, and descriptive metadata |
|
|
Descriptive metadata field names and values, and the unique ID numbers for Archive-It seeds to which they are applied |
|
|
Unique ID numbers for seeds, their groups, and their individual group memberships |
|
|
Unique ID numbers for seed groups and their individual collection memberships |
|
|
/seed_group |
[DEPRECATED - See tables above] Information about partner-defined seed groupings, including group names, unique ID numbers, and relevant collection numbers |
Crawls
|
Data table |
Description |
|
Running log of Archive-It crawl jobs as they are initiated, including timestamps and creator usernames |
|
|
Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, crawl limits, and scoping rules |
|
|
Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, and scoping rules |
|
|
Machine-generated information about Archive-It crawls, such as data and document counts and rates, start and end times, and host crawling machines |
|
|
Data logged for Archive-It crawls before and after any manual resumption, including data and document counts, time elapsed, and end status |
|
|
Status information about Archive-It crawling machines, such as host names, total and available memory, and crawl jobs running |
|
|
List of current automated crawl frequencies available to Archive-It partners |
|
|
Log of limit modifications made to running Archive-It crawls, such as changes to time, data, or document limits |
|
|
List of URLs available for patch crawling, including dates retrieved through Wayback QA and source archived pages |
|
|
Log of missing URL records and their source crawl jobs |
|
|
List of seeds crawled with the "Standard+" seed type and corresponding log of their crawl jobs |
|
|
Log of patch crawls run from an Archive-It crawl's Hosts report, including relevant host names, document counts, and originating crawl jobs |
|
| /reports/host/<id> |
Summary information shown under the Archive-It post-crawl report's Hosts tab, including host names, document counts; and blocked, queued, and out of scope lists |
|
/reports/mimetype/<id> |
Summary information shown under the Archive-It post-crawl report's Filetypes tab, including mimetimes and their respective data and document counts |
| /reports/seed/<id> |
Summary information shown under the Archive-It post-crawl report's Seeds tab, including seed URLs and their respective crawling status and data/document counts |
|
Log of scheduled crawl jobs, including unique ID numbers, timestamps scheduled, and timestamps initiated |
|
|
List of all collection- and seed-level scoping rules, including rule types, creators, and creation and update timestamps |
|
|
Log of URLs, HTTP status codes, and unique crawl and seed ID numbers for seeds in Archive-It crawls |
Archive-It Research Services (ARS) [DEPRECATED]
|
Data table |
Description |
|
This feature is deprecated. To derive datasets from web archive collections, see Archives Research Compute Hub (ARCH). |
Tables used by the Internet Archive
Data in the following tables are used by Internet Archive staff for maintenance and may not be visible to partners:
/feature_item
/limits_mod
/press_item
/research_services_request
/user
/warc_file
/webinar
Response formats
Data is delivered in JSON format by default. You may alternatively specify another format to receive your response in XML or CSV format like so:
Advanced use
Filtering and sorting responses
Each attribute and value in a data table can be used to filter API queries.
For instance, the home page of the Archive-It web application includes a current list of active collections within the partner’s account. The collection data table includes the state attribute to indicate if a collection’s status is ACTIVE or INACTIVE. A query for data about active collections looks like this:
Use the pluck function to restrict any API response to any one attribute's values. For instance, to retrieve only the name values from the same array of active collections above:
Use the sort function to order any responses alphabetically or numerically. For instance, to retrieve the collection names above in the order that each collection was created:
Response limits
Archive-It Partner API calls are limited to 100 results by default. You may specify other limits or remove limits entirely.
For instance, an Archive-It Partner API call for all seed URLs in an account, in the order that they were created, will cap the results at the 100th when constructed as:
It will return all of the matching results when the limit value is set to -1:
And it may be restricted to return only the first five seeds when the limit is set to 5:
Comments
0 comments
Please sign in to leave a comment.