How (and why) to use Wayback's back-end index
FeaturedEver wonder how the Wayback interface finds the right archival web material to surface on calendar pages or direct link traffic to the right capture and date? It all has to do with the “CDX”--an index and file format specific to web archiving that you can think of as Wayback’s switchboard. To get the most of this resource available to all Archive-It partners and users of web archives, see the new Help Center guide: Access Archive-It's Wayback index with the CDX/C API
Like an index at the back of a book, this Archive-It index can be used to answer “where” and "when" questions that are increasingly common among web archivists. Want to know how many times and precisely when a given web document or host was archived in your collection? Or by all Archive-It partners? Want to know exactly how many new or duplicate captures have been made without digging through all of those crawl reports? Or how about finding the actual bytes that constitute an archived document precisely where they reside in WARC storage?
Put to these kinds of use, the index is helpful towards quality assurance and access. My fellow web archivists and I use it ourselves quite often to understand whether and to what degree desired web content was archived over the course of many different crawls. Archive-It partners, in the meantime, have begun to use it as a reference point for regularly updated, growing collections. The same index that populates the Wayback calendar page, that is, can enrich your finding aids or catalog records as well. For an example experiment in this approach, see for instance Greg Wiedeman’s guest post on the Archive-It blog: A Sustainable, Large-Scale, Minimal Approach to Accessing Web Archives.
We’ve barely scratched the surface of possibilities. We have some enhancements for the CDX/C in mind for the near future. If you have use cases, questions, or ideas, share them with all of us below!
Please sign in to leave a comment.
Comments
0 comments