Overview
The Archive-It Partner API provides access to information about Archive-It partners, collections, and crawls outside of your Archive-It account or pages on archive-it.org. Credentialed Archive-It users may query this data from Archive-It's database in a web browser or from the command line.
Partners retrieve this information in order to develop custom access layers, to manage administrative or descriptive metadata externally, or to run periodic account usage reports.
Prerequisites:
- How to set up and administer your Archive-It account
- How to create and manage a collection
- How to find your Archive-It collection's ID number
On this page:
Instructions
Basic use
Access the Archive-It Partner API as a general public user or credentialed Archive-It partner at: https://partner.archive-it.org/api/
Data is stored on the tables described below.
Account
Data table |
Description |
Account-level partner metadata, such as organization name, data budget, and public access URL |
|
Information about authorized Archive-It user's own account, such as username and account name, creation and update timestamps, and email address |
|
Authorized users of Archive-It account, including user names, privilege levels, and email addresses |
Collections
Data table |
Description |
Information about Archive-It collections, including names, ID numbers, active statuses, and descriptive metadata |
|
Descriptive metadata values and fields added to individual documents in Archive-It collections, such as titles, creators, subjects, etc. |
|
Information about free text “notes” fields edited manually by Archive-It partners, including timestamps, text content, and creator usernames |
|
Log of W/ARC files manually uploaded to Archive-It account, including filenames, collection numbers timestamps, and checksum values |
Seeds
Data table |
Description |
Metadata about all seeds in an Archive-It account, including URLs, seed types, creation and update timestamps, groups, and descriptive metadata |
|
Descriptive metadata field names and values, and the unique ID numbers for Archive-It seeds to which they are applied |
|
Unique ID numbers for seeds, their groups, and their individual group memberships |
|
Unique ID numbers for seed groups and their individual collection memberships |
|
/seed_group |
[DEPRECATED - See tables above] Information about partner-defined seed groupings, including group names, unique ID numbers, and relevant collection numbers |
Crawls
Data table |
Description |
Running log of Archive-It crawl jobs as they are initiated, including timestamps and creator usernames |
|
Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, crawl limits, and scoping rules |
|
Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, and scoping rules |
|
Machine-generated information about Archive-It crawls, such as data and document counts and rates, start and end times, and host crawling machines |
|
Data logged for Archive-It crawls before and after any manual resumption, including data and document counts, time elapsed, and end status |
|
Status information about Archive-It crawling machines, such as host names, total and available memory, and crawl jobs running |
|
List of current automated crawl frequencies available to Archive-It partners |
|
Log of limit modifications made to running Archive-It crawls, such as changes to time, data, or document limits |
|
List of URLs available for patch crawling, including dates retrieved through Wayback QA and source archived pages |
|
Log of missing URL records and their source crawl jobs |
|
List of seeds crawled with the "Standard+" seed type and corresponding log of their crawl jobs |
|
Log of patch crawls run from an Archive-It crawl's Hosts report, including relevant host names, document counts, and originating crawl jobs |
|
/reports/host/<id> |
Summary information shown under the Archive-It post-crawl report's Hosts tab, including host names, document counts; and blocked, queued, and out of scope lists |
/reports/mimetype/<id> |
Summary information shown under the Archive-It post-crawl report's Filetypes tab, including mimetimes and their respective data and document counts |
/reports/seed/<id> |
Summary information shown under the Archive-It post-crawl report's Seeds tab, including seed URLs and their respective crawling status and data/document counts |
Log of scheduled crawl jobs, including unique ID numbers, timestamps scheduled, and timestamps initiated |
|
List of all collection- and seed-level scoping rules, including rule types, creators, and creation and update timestamps |
|
Log of URLs, HTTP status codes, and unique crawl and seed ID numbers for seeds in Archive-It crawls |
Archive-It Research Services (ARS) [DEPRECATED]
Data table |
Description |
Status of any ARS data derivation requests made through the Archive-It web application. This is a deprecated feature. To derive datasets from web archive collections, see: Archives Research Compute Hub (ARCH). |
Tables used by the Internet Archive
Data on these tables are used by Internet Archive staff for maintenance of the Archive-It web application and public website, and so may not be visible to partners:
/feature_item
/press_item
/research_services_request
/webinar
Response formats
Data is delivered in JSON format by default. You may alternatively specify another format to receive your response in XML or CSV format like so:
-
https://partner.archive-it.org/api/collection?format=xml
- https://partner.archive-it.org/api/collection?state=ACTIVE&pluck=name&sort=created_date&format=csv
NB: Users of the Firefox web browser might need to manually add the .csv file format extension to any downloaded data manually in order to few it as a spreadsheet.
Advanced use
Filtering and sorting responses
Each attributes and value in a data table can be used to filters API queries.
For instance, the home of page of the Archive-It web application includes a current list of active collections within the partner’s account. The collection data table includes the state attribute to indicate whether or a collection’s status is ACTIVE or INACTIVE, so a query for data about active collections looks like this:
Use the pluck function to restrict any API response to any one attribute's values. For instance, to retrieve only the name values from the same array of active collections above:
Use the sort function to order any responses alphabetically or numerically. For instance, to retrieve the collection names above in the order that each collection was created:
Response limits
Archive-It Partner API calls are limited to 100 results by default. You may specify other limits or remove limits entirely.
For instance, an Archive-It Partner API call for all seed URLs in an account, in the order that they were created, will cap the results at the 100th when constructed as:
It will return all of the matching results when the limit value is set to -1:
And it may be restricted to return only the first five seeds when the limit is set to 5:
Comments
0 comments
Please sign in to leave a comment.