Background

Archive-It’s Wayback CDX is the index of all archived content that the Wayback browsing interface uses to lookup and serve the specific captures requested by an end-user, such as from the Wayback calendar page. The index format is known as 'CDX' and contains various fields that describe each record, sorted by URL and date. The index's server responds to GET queries and returns the plain text CDX data. The CDX server is deployed as part of the wayback.archive-it.org Wayback browsing interface and was derived from the CDX server deployed for the general archive at web.archive.org, as part of the open-source Wayback Machine software: https://github.com/internetarchive/wayback.

For more information on the general CDX file format, see: http://archive.org/web/researcher/cdx_file_format.php

Why "CDX/C"?

Unlike the global Wayback index at archive.org, the CDX/C API enables querying of archived data by collection, meaning that a user may query it to discover records of captures within one of their own, another Archive-It partner’s, or all Archive-It partners’ collections.

Use cases

Using the CDX/C API to query Archive-It data is a way to discover if and to what extent web content has been archived by Archive-It partners. Partners can use the API to find out if and when specific documents were archived, and to locate that data in its WARC file storage, among other things. They may also find and filter by various other capture attributes in order to analyze the extent and nature of their collecting any specified documents or hosts.

To see how partner Greg Wiedeman of the University at Albany, SUNY, uses the CDX/C to dynamically query the index for records to reference in finding aids for collections in which websites are captured on a regular and ongoing basis, see his Archive-It blog guest post: A Sustainable, Large-Scale, Minimal Approach to Accessing Web Archives.

How it works

The CDX/C is effectively a table of plain text data. Each line (“record”) indicates a crawled document. For instance, the first record for the query: https://wayback.archive-it.org/8232/timemap/cdx?url=https://twitter.com/internetarchive/ appears as:

CDX Attributes

The publicly available attributes in the CDX/C index are:

Attribute	Explanation	Example
urlkey	the document captured, expressed as a SURT	com,twitter)/internetarchive
timestamp	time at which the document was captured	20161206224935
original	the document captured, as expressed as a URL	https://twitter.com/internetarchive/
mimetype	the document’s file type	text/html warc/revisit [if de-duplicated]
status code	HTTP response code for the document at the time of its crawling	200 302 404
digest	the unique, Base32-encoded SHA-1 checksum value for the document, to distinguish it from others	L5DWB6VD575XTO5QPCCKE7KEQXG4GQ56
-	[deprecated field]	-
flags	indicates presence of login prompt or other crawler obstruction	- [site not password protected]  P [site is password protected] F [no follow directive] I [no index directive] A [no archive]
length	the document’s volume of bytes within its WARC file	43776
offset	byte start point for the document in its WARC file	424203786
file name	the name of the WARC file in which the queried data is stored	ARCHIVEIT-8232-CRAWL_SELECTED_SEEDS-JOB253036-20161206224436214-00002.warc.gz

Basic use

Access the CDX/C API by clicking on the green CDX button on the Calendar page for an archived document.

Note: The calendar page collapses captures taken within minutes of each other under a single timestamp. This is why you may see more records in CDX than captures on the Wayback calendar page.

Alternatively, CDX/C API queries may be made by curl command or in a web browser. Using either method, the endpoint is: http://wayback.archive-it.org/all/timemap/cdx. The only required parameter for the CDX server is the url= parameter. Here is an example:

http://wayback.archive-it.org/all/timemap/cdx?url=archive-it.org

The above query will return a portion of the Archive-It’s CDX index, one capture per row, for each capture of the URL "archive-it.org" that is available in Archive-It Wayback. In order to customize this, simply change the ending URL.

⚠️ Queries to the /all CDX endpoint are temporarily blocked due to DDoS activity. All other queries work as expected. Let us know if this impacts you.

Query by collection

To focus your queries to return results from only one specified Archive-It collection, replace ‘all’ in the endpoint’s query string with the desired collection’s collection ID number:

http://wayback.archive-it.org/4399/timemap/cdx?url=archive-it.org

The above example query will return only the portion of the CDX index that includes captures of archive-it.org from Archive-It collection #4399. Please note that for private collections, you have to include the collection ID in order to see results.

Advanced use

Filter and order fields

It is possible to customize the fields returned from the cdx server using the fl= parameter. Simply pass in a comma separated list of the desired fields and only those fields will be returned, in the order in which you query them: The following example, for instance, returns only the timestamp and filename fields for captures of archive-it.org in collection #4399: http://wayback.archive-it.org/4399/timemap/cdx?url=archive-it.org&fl=timestamp,filename

Filter by date range

Results may be filtered by timestamp using from= and to= parameters. The ranges are inclusive and are specified in the same 1 to 14 digit format used for Wayback captures: yyyyMMddhhmmss. The following example, for instance, returns all captures of archive-it.org in the CDX index, across all collections, that were captured between 2010 and 2011: http://wayback.archive-it.org/all/timemap/cdx?url=archive-it.org&from=2010&to=2011

Collapse results

One useful form of filtering query results is the option to 'collapse' them based on the value of a field, or of a substring of a field. Collapsing is done on adjacent CDX lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for totally unique captures.

To use collapsing, add one or more collapse=field or collapse=field:N parameters, where N is the first N characters of field. In the following example, for instance, results are collapsed to show only one result per hour of capture time (compare the first 10 digits of the yyyyMMddhhmmss timestamp field): http://wayback.archive-it.org/4399/timemap/cdx?url=archive-it.org&collapse=timestamp:10

The calendar pages at wayback.archive-it.org use this filter by default.

More advanced options

Partners are welcome to experiment with the further advanced querying options described in the general Wayback CDX API documentation here: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#advanced-usage

Articles in this section

Access Archive-It's Wayback index with the CDX/C API

Background

Why "CDX/C"?

Use cases

How it works

CDX Attributes

Basic use

Query by collection

Advanced use

Filter and order fields

Filter by date range

Collapse results

More advanced options

Comments

Articles in this section

Background

Why "CDX/C"?

Use cases

How it works

CDX Attributes

Basic use

Query by collection

Advanced use

Filter and order fields

Filter by date range

Collapse results

More advanced options

Related articles