Explaining "waybackfill" to users

Featured

Comments

3 comments

  • Avatar
    Alex Thurman

    Hi Sarah

    At Columbia we implemented a Waybackfill last year. We mentioned it in a Libraries blog post:

    https://blogs.cul.columbia.edu/rbml/2022/01/19/now-available-columbia-university-web-archives-1996-2010/

    And we added the following sentence at the end of the collection description on the collection access page:

    "In January 2022, the collection was supplemented with a Waybackfill of all the archived data from the columbia.edu domain between 1996 and 2009 available in the Internet Archive's global Wayback Machine, in order to provide collection users with seamless access to a fuller range of archived Columbia web content."

    We didn't add specific metadata fields/notes regarding the Waybackfill aspect, but are in the process of identifying obsolete URLs captured in the Waybackfill load in order to include them in our public seed list for direct access.

    --Alex

    0
    Comment actions Permalink
  • Avatar
    Sarah Weeks

    Hi Alex, Fantastic blog post! I love how friendly, inviting, and explanatory it is. Thanks for sharing.

    When you say you're identifying obsolete URLs - We submitted a spreadsheet of URLs to A-I that we were interested in backfilling. Was your process different?

    Actually...I believe we have discovered a few URLs not on that list that we've found have been backfilled. How are you finding your obsolete URLs? With a big university website, there's bound to be lots of little-known subdomains. I don't know of a systemic way to discover that.

    0
    Comment actions Permalink
  • Avatar
    Alex Thurman

    Our Waybackfill was very broad and simple: all URLs from any columbia.edu subdomain present in the global Wayback Machine for 1996 to June 2010. 

    Since we started archiving the columbia.edu domain using Archive-It in 2010, we've gradually added hundreds of seed URLs of specific subdomains within columbia.edu (or non-columbia.edu URLs from affiliated institutes etc with vanity URLs). As we browse around now in the pre-2010 archived content if we see URLs not already in our seed list (because obsolete), we add them to the public seed list just for access purposes, not further crawling. I'm particularly interested in surfacing archived student publications that were discontinued, and changes in department names/structures, obsolete centers & institutes, etc.

    0
    Comment actions Permalink

Please sign in to leave a comment.