Archive-It partners may use the Partner API to access all of the same information about their accounts, collections, and crawls, as can be seen through the Archive-It web application. Credentialed users of the web application may retrieve this data from Archive-It’s back-end database in a web browser or from the command line.
Partners retrieve this information in order to develop their own custom access layers, to manage administrative metadata from disperse modules across the web application, or to preserve technical and descriptive metadata. If you require further assistance for functions or new uses of the API, see: More information. For instructions specific to accessing the Archive-It Wayback index, the full-text search index, or your WARC files in storage, see the further options in: Archive-It APIs and integrations.
Table of contents:
Basic use
Data may be retrieved by authenticated users of the Archive-It web application at: https://partner.archive-it.org/api/
This root API endpoint provides access to data stored among the tables described below. Data is presented in JSON format by default, but may be migrated to other formats. Follow the endpoints provided in order to see and parse each data table’s full contents.
Data tables
Jump to:
Account
Data table |
Description |
Account-level partner metadata, such as organization name, data budget, and public access URL |
|
Information about authorized Archive-It user's own account, such as username and account name, creation and update timestamps, and email address |
|
Authorized users of Archive-It account, including user names, privilege levels, and email addresses |
Collections
Data table |
Description |
Information about Archive-It collections, including names, ID numbers, active statuses, and descriptive metadata |
|
Descriptive metadata values and fields added to individual documents in Archive-It collections, such as titles, creators, subjects, etc. |
|
Information about free text “notes” fields edited manually by Archive-It partners, including timestamps, text content, and creator usernames |
|
Metadata about all seeds in an Archive-It account, including URLs, seed types, creation timestamps, and descriptive metadata |
|
Information about partner-defined seed groupings, including group names, unique ID numbers, and relevant collection numbers |
|
Descriptive metadata field names and values, and the unique ID numbers for Archive-It seeds to which they are applied |
|
Log of W/ARC files manually uploaded to Archive-It account, including filenames, collection numbers timestamps, and checksum values |
Crawls
Data table |
Description |
Running log of Archive-It crawl jobs as they are initiated, including timestamps and creator usernames |
|
Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, crawl limits, and scoping rules |
|
Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, and scoping rules |
|
Machine-generated information about Archive-It crawls, such as data and document counts and rates, start and end times, and host crawling machines |
|
Data logged for Archive-It crawls before and after any manual resumption, including data and document counts, time elapsed, and end status |
|
Status information about Archive-It crawling machines, such as host names, total and available memory, and crawl jobs running |
|
List of current automated crawl frequencies available to Archive-It partners |
|
Log of limit modifications made to running Archive-It crawls, such as changes to time, data, or document limits |
|
List of URLs available for patch crawling, including dates retrieved through Wayback QA and source archived pages |
|
Log of missing URL records and their source crawl jobs |
|
List of seeds crawled with the "Standard+" seed type and corresponding log of their crawl jobs |
|
Log of patch crawls run from an Archive-It crawl's Hosts report, including relevant host names, document counts, and originating crawl jobs |
|
/reports/host/<id> |
Summary information shown under the Archive-It post-crawl report's Hosts tab, including host names, document counts; and blocked, queued, and out of scope lists |
/reports/mimetype/<id> |
Summary information shown under the Archive-It post-crawl report's Filetypes tab, including mimetimes and their respective data and document counts |
/reports/seed/<id> |
Summary information shown under the Archive-It post-crawl report's Seeds tab, including seed URLs and their respective crawling status and data/document counts |
Log of scheduled crawl jobs, including unique ID numbers, timestamps scheduled, and timestamps initiated |
|
List of all collection- and seed-level scoping rules, including rule types, creators, and creation and update timestamps |
|
Log of URLs, HTTP status codes, and unique crawl and seed ID numbers for seeds in Archive-It crawls |
Archive-It Research Services (ARS)
Data table |
Description |
Current status of any ARS data derivation requests made through the Archive-It web application |
For more information about requesting and retrieving Web ARChive (WARC) and derivative data files, see: Find and download your WARC files with WASAPI
Staff
Data on these tables are used by Internet Archive staff for maintenance of the Archive-It web application and public website, and so may not be visible to partners:
/feature_item
/press_item
/research_services_request
/webinar
Advanced use
Filtering and sorting
Each data table’s attributes and values may be used as filters to refine API calls. Adding these attribute names and values as operators to the endpoint URL will return more specific data.
For instance, the home of page of the Archive-It web application includes a current list of active collections within the partner’s account. The collection data table includes the state attribute to indicate whether or a collection’s status is “ACTIVE” or “INACTIVE,” so an API call for data about active collections only can be constructed as:
https://partner.archive-it.org/api/collection?state=ACTIVE
The pluck function may also be used to further restrict the display of data to specified attributes only. For instance, to retrieve only the name values of the active collections above, as they are seen in the Archive-It web application, an API call can be constructed as:
https://partner.archive-it.org/api/collection?state=ACTIVE&pluck=name
Regardless of the specified display data, these responses may be organized by any data table attribute with a sort operator. For instance, to retrieve the collection names above in the order that each collection was created, an API call can be constructed as:
https://partner.archive-it.org/api/collection?state=ACTIVE&pluck=name&sort=created_date
Format
The Archive-It Partner API delivers data in JSON format by default. However, an additional operator may be employed to retrieve the same data in XML or CSV format.
For example, to retrieve all of the same data about the Archive-It collections above as XML rather than JSON, an API call can be constructed as:
https://partner.archive-it.org/api/collection?format=xml
Or, to retrieve the fully filtered and sorted data about the collections above and download them in a CSV file:
https://partner.archive-it.org/api/collection?state=ACTIVE&pluck=name&sort=created_date&format=csv
Please note that users of the Firefox web browser might need to manually add the .csv file format extension to any downloaded data manually in order to few it as a spreadsheet.
Limits
Please note that all Archive-It Partner API calls are limited to 100 results by default. If not specified otherwise, all API calls will include an &limit=100 filter automatically. You may include other limits in API calls manually, or remove them with the &limit=-1 filter.
For instance, an Archive-It Partner API call for all seed URLs in an account, in the order that they were created, will cap the results at the 100th when constructed as:
https://partner.archive-it.org/api/seed?sort=created_date&pluck=url
It will return all of the matching results when constructed as:
https://partner.archive-it.org/api/seed?sort=created_date&pluck=url&limit=-1
And it may be restricted to return only the first five seed URLs created as:
https://partner.archive-it.org/api/seed?sort=created_date&pluck=url&limit=5
More information
Video demonstration
For a recorded introduction to and demonstration of the API at work, complete with Q&A with Archive-It partners, see: Archive-It Advanced Training: Introduction to the Archive-It Partner API.
Help
Contact Archive-It’s Web Archivists directly if you require further assistance using the Partner API to retrieve specific data from your account.
For instructions specific to accessing the Archive-It Wayback index, the full-text search index, or your WARC files in storage, see the further options in: Archive-It APIs and integrations.
Comments
0 comments
Please sign in to leave a comment.