Archive-It’s Wayback CDX is the index of all archived content that the Wayback browsing interface uses to lookup and serve the specific captures requested by an end-user, such as from the Wayback calendar page. The index format is known as 'CDX' and contains various fields that describe each record, sorted by URL and date. The index's server responds to GET queries and returns the plain text CDX data. The CDX server is deployed as part of the wayback.archive-it.org Wayback browsing interface and was derived from the CDX server deployed for the general archive at web.archive.org, as part of the open-source Wayback Machine software: https://github.com/internetarchive/wayback.
For more information on the general CDX file format, see: http://archive.org/web/researcher/cdx_file_format.php
Unlike the global Wayback index at archive.org, the CDX/C API enables querying of archived data by collection, meaning that a user may query it to discover records of captures within one of their own, another Archive-It partner’s, or all Archive-It partners’ collections.
Using the CDX/C API to query Archive-It data is a quick and easy way to discover if and to what extent web content has been archived by Archive-It partners. Partners can use the API to find out if and when specific documents were archived, and to locate that data in its WARC file storage, among other things. They may also find and filter by various other capture attributes in order to analyze the extent and nature of their collecting any specified documents or hosts.
To see how partner Greg Wiedeman of the University at Albany, SUNY, uses the CDX/C to dynamically query the index for records to reference in finding aids for collections in which websites are captured on a regular and ongoing basis, see his Archive-It blog guest post: A Sustainable, Large-Scale, Minimal Approach to Accessing Web Archives.
How it works
The CDX/C is effectively a table of plain text data. Each line (“record”) indicates a crawled document. For instance, the first record for the query: https://wayback.archive-it.org/8232/timemap/cdx?url=https://twitter.com/internetarchive/ appears as:
These attributes of this record are described in the table below. At this time, and in the order in which they appear by default, these publicly available attributes in the CDX/C index are:
|urlkey||the document captured, expressed as a SURT||com,twitter)/internetarchive|
|timestamp||time at which the document was captured||20161206224935|
|original||the document captured, as expressed as a URL||https://twitter.com/internetarchive/|
|mimetype||the document’s file type||
|status code||HTTP response code for the document at the time of its crawling||
|digest||the unique, Base32-encoded SHA-1 checksum value for the document, to distinguish it from others||L5DWB6VD575XTO5QPCCKE7KEQXG4GQ56|
|login||indicates whether or not crawler logged in with credentials, also notes robots blocks discovered in source code
- [no log in]
|length||the document’s volume of bytes wit in its WARC file||43776|
|offset||byte start point for the document in its WARC file||424203786|
|file name||the name of the WARC file in which the queried data is stored||ARCHIVEIT-8232-CRAWL_SELECTED_SEEDS-JOB253036-20161206224436214-00002.warc.gz|
CDX/C API queries may be made by curl command or in a web browser. Using either method, the endpoint is: http://wayback.archive-it.org/all/timemap/cdx. The only required parameter for the CDX server is the url= parameter. Therefore, the simplest query takes the following form:
The above query will return a portion of the Archive-It’s CDX index, one capture per row, for each capture of the URL "archive-it.org" that is available in Archive-It Wayback.
Query by collection
To focus your queries to return results from only one specified Archive-It collection, replace ‘all’ in the endpoint’s query string with the desired collection’s collection ID number:
The above example query will return only the portion of the CDX index that includes captures of archive-it.org from Archive-It collection #4399.
Filter and order fields
It is possible to customize the fields returned from the cdx server using the fl= parameter. Simply pass in a comma separated list of the desired fields and only those fields will be returned, in the order in which you query them: The following example, for instance, returns only the timestamp and filename fields for captures of archive-it.org in collection #4399: http://wayback.archive-it.org/4399/timemap/cdx?url=archive-it.org&fl=timestamp,filename
Filter by date range
Results may be filtered by timestamp using from= and to= parameters. The ranges are inclusive and are specified in the same 1 to 14 digit format used for Wayback captures: yyyyMMddhhmmss. The following example, for instance, returns all captures of archive-it.org in the CDX index, across all collections, that were captured between 2010 and 2011: http://wayback.archive-it.org/all/timemap/cdx?url=archive-it.org&from=2010&to=2011
One useful form of filtering query results is the option to 'collapse' them based on the value of a field, or of a substring of a field. Collapsing is done on adjacent CDX lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for totally unique captures.
To use collapsing, add one or more collapse=field or collapse=field:N parameters, where N is the first N characters of field. In the following example, for instance, results are collapsed to show only one result per hour of capture time (compare the first 10 digits of the yyyyMMddhhmmss timestamp field): http://wayback.archive-it.org/4399/timemap/cdx?url=archive-it.org&collapse=timestamp:10
The calendar pages at wayback.archive-it.org use this filter by default.
More advanced options
Partners are welcome to experiment with the further advanced querying options described in the general Wayback CDX API documentation here: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#advanced-usage