Access your account with the Archive-It Partner API

Overview

The Archive-It Partner API provides access to information about Archive-It partners, collections, and crawls outside of your Archive-It account or pages on archive-it.org. Credentialed Archive-It users can query this data from Archive-It's database in a web browser or from the command line.

Partners can retrieve this information to develop custom access layers, manage administrative or descriptive metadata externally, or run periodic account usage reports.

Prerequisites:

Instructions

Basic use

Access the Archive-It Partner API as a public user or credentialed Archive-It partner at https://partner.archive-it.org/api/

Data is stored in the following tables.

Account
Collections
Seeds
Crawls
Archive-It Research Services (ARS)
Staff

Account

Data table	Description
/account	Account-level partner metadata, such as organization name, data budget, and public access URL
/auth/list	Information about an authorized Archive-It user's own account, such as username and account name, creation and update timestamps, and email address
/user	Authorized users of Archive-It account, including usernames, access levels, and email addresses

Collections

Data table	Description
/collection	Information about Archive-It collections, including names, ID numbers, active statuses, and descriptive metadata
/document_metadata	Descriptive metadata fields and values added to individual documents in Archive-It collections, such as titles, creators, subjects, etc.
/note	Information about free text “notes” fields added by Archive-It partners, including timestamps, text, and creator usernames
/warc_upload	Log of W/ARC files manually uploaded to an Archive-It account, including filenames, collection numbers, timestamps, and checksum values

Seeds

Data table	Description
/seed	Metadata about all seeds in an Archive-It account, including URLs, seed types, creation and update timestamps, groups, and descriptive metadata
/seed_metadata	Descriptive metadata field names and values, and the unique ID numbers for Archive-It seeds to which they are applied
/seed_seed_group	Unique ID numbers for seeds, their groups, and their individual group memberships
/seed_group_collection	Unique ID numbers for seed groups and their individual collection memberships
/seed_group	[DEPRECATED - See tables above] Information about partner-defined seed groupings, including group names, unique ID numbers, and relevant collection numbers

Crawls

Data table	Description
/changelog	Running log of Archive-It crawl jobs as they are initiated, including timestamps and creator usernames
/crawl_config_snapshot	Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, crawl limits, and scoping rules
/crawl_definition	Information created by partners about Archive-It crawls, including relevant seed URLs and types, frequencies, and scoping rules
/crawl_job	Machine-generated information about Archive-It crawls, such as data and document counts and rates, start and end times, and host crawling machines
/crawl_job_run	Data logged for Archive-It crawls before and after any manual resumption, including data and document counts, time elapsed, and end status
/crawling_machine	Status information about Archive-It crawling machines, such as host names, total and available memory, and crawl jobs running
/frequency	List of current automated crawl frequencies available to Archive-It partners
/limits_mod	Log of limit modifications made to running Archive-It crawls, such as changes to time, data, or document limits
/missing_url	List of URLs available for patch crawling, including dates retrieved through Wayback QA and source archived pages
/missing_url_patch_crawl_seed	Log of missing URL records and their source crawl jobs
/one_off_seed	List of seeds crawled with the "Standard+" seed type and corresponding log of their crawl jobs
/patch_crawl_host	Log of patch crawls run from an Archive-It crawl's Hosts report, including relevant host names, document counts, and originating crawl jobs
/reports/host/<id>	Summary information shown under the Archive-It post-crawl report's Hosts tab, including host names, document counts; and blocked, queued, and out of scope lists
/reports/mimetype/<id>	Summary information shown under the Archive-It post-crawl report's Filetypes tab, including mimetimes and their respective data and document counts
/reports/seed/<id>	Summary information shown under the Archive-It post-crawl report's Seeds tab, including seed URLs and their respective crawling status and data/document counts
/scheduled_crawl_event	Log of scheduled crawl jobs, including unique ID numbers, timestamps scheduled, and timestamps initiated
/scope_rule	List of all collection- and seed-level scoping rules, including rule types, creators, and creation and update timestamps
/seed_report_entry	Log of URLs, HTTP status codes, and unique crawl and seed ID numbers for seeds in Archive-It crawls

Archive-It Research Services (ARS) [DEPRECATED]

Data table	Description
/collection_research_services	This feature is deprecated. To derive datasets from web archive collections, see Archives Research Compute Hub (ARCH).

Tables used by the Internet Archive

Data in the following tables are used by Internet Archive staff for maintenance and may not be visible to partners:

/feature_item
/limits_mod
/press_item
/research_services_request
/user
/warc_file
/webinar

Response formats

Data is delivered in JSON format by default. You may alternatively specify another format to receive your response in XML or CSV format like so:

Advanced use

Filtering and sorting responses

Each attribute and value in a data table can be used to filter API queries.

For instance, the home page of the Archive-It web application includes a current list of active collections within the partner’s account. The collection data table includes the state attribute to indicate if a collection’s status is ACTIVE or INACTIVE. A query for data about active collections looks like this:

https://partner.archive-it.org/api/collection?state=ACTIVE

Use the pluck function to restrict any API response to any one attribute's values. For instance, to retrieve only the name values from the same array of active collections above:

https://partner.archive-it.org/api/collection?state=ACTIVE&pluck=name

Use the sort function to order any responses alphabetically or numerically. For instance, to retrieve the collection names above in the order that each collection was created:

https://partner.archive-it.org/api/collection?state=ACTIVE&pluck=name&sort=created_date

Response limits

Archive-It Partner API calls are limited to 100 results by default. You may specify other limits or remove limits entirely.

For instance, an Archive-It Partner API call for all seed URLs in an account, in the order that they were created, will cap the results at the 100th when constructed as:

https://partner.archive-it.org/api/seed?sort=created_date&pluck=url

It will return all of the matching results when the limit value is set to -1:

https://partner.archive-it.org/api/seed?sort=created_date&pluck=url&limit=-1

And it may be restricted to return only the first five seeds when the limit is set to 5:

https://partner.archive-it.org/api/seed?sort=created_date&pluck=url&limit=5

Articles in this section

Overview

Prerequisites:

On this page:

Instructions

Basic use

Account

Collections

Seeds

Crawls

Archive-It Research Services (ARS) [DEPRECATED]

Tables used by the Internet Archive

Response formats

Advanced use

Filtering and sorting responses

Response limits

Related content

Comments

Articles in this section

Overview

Prerequisites:

On this page:

Instructions

Basic use

Account

Collections

Seeds

Crawls

Archive-It Research Services (ARS) [DEPRECATED]

Tables used by the Internet Archive

Response formats

Advanced use

Filtering and sorting responses

Response limits

Related content

Related articles