Explaining "waybackfill" to users

Featured

October 25, 2022 14:10

Hello all - we just had waybackfilling done (Archive-It explains it here, if you're unfamiliar). We were thinking of creating a metadata field on the seeds affected, and also a statement to add to the finding aid for our websites collection. If you've had this done, how have you identified that in your collection, and how have you conveyed the news to users? Thanks!

Comments

3 comments

Alex Thurman October 25, 2022 14:29

Hi Sarah

At Columbia we implemented a Waybackfill last year. We mentioned it in a Libraries blog post:

https://blogs.cul.columbia.edu/rbml/2022/01/19/now-available-columbia-university-web-archives-1996-2010/

And we added the following sentence at the end of the collection description on the collection access page:

"In January 2022, the collection was supplemented with a Waybackfill of all the archived data from the columbia.edu domain between 1996 and 2009 available in the Internet Archive's global Wayback Machine, in order to provide collection users with seamless access to a fuller range of archived Columbia web content."

We didn't add specific metadata fields/notes regarding the Waybackfill aspect, but are in the process of identifying obsolete URLs captured in the Waybackfill load in order to include them in our public seed list for direct access.

--Alex

0

Comment actions Permalink
Sarah Weeks October 25, 2022 16:28

Hi Alex, Fantastic blog post! I love how friendly, inviting, and explanatory it is. Thanks for sharing.

When you say you're identifying obsolete URLs - We submitted a spreadsheet of URLs to A-I that we were interested in backfilling. Was your process different?

Actually...I believe we have discovered a few URLs not on that list that we've found have been backfilled. How are you finding your obsolete URLs? With a big university website, there's bound to be lots of little-known subdomains. I don't know of a systemic way to discover that.

0

Comment actions Permalink
Alex Thurman October 25, 2022 16:48

Our Waybackfill was very broad and simple: all URLs from any columbia.edu subdomain present in the global Wayback Machine for 1996 to June 2010.

Since we started archiving the columbia.edu domain using Archive-It in 2010, we've gradually added hundreds of seed URLs of specific subdomains within columbia.edu (or non-columbia.edu URLs from affiliated institutes etc with vanity URLs). As we browse around now in the pre-2010 archived content if we see URLs not already in our seed list (because obsolete), we add them to the public seed list just for access purposes, not further crawling. I'm particularly interested in surfacing archived student publications that were discontinued, and changes in department names/structures, obsolete centers & institutes, etc.

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?