Content files from old website
Hi,
I am working with our campus's Strategic Communications unit who have held on to old website files (circa 2009-2012) (HTML, PHP, video files, CSS) and are asking if Special Collections can crawl the website they are a part of. They are willing to put these files up on the live web in order for us to capture them with Archive-It, but the problem with this is that they would override our current web presence (this is main website for our campus, btw uci.edu). They are willing to change the URL to something like archive.uci.edu so that this is not an issue.
My questions are:
1. Can Archive-It crawl semi-public websites (campus VPN, or would they have to "punch a hole in router to allow Archive-It to pass through")?
2. Once captured can we manually change crawl dates? (my guess is no, and I talked to them about adding metadata to help reflect the time these websites were created).
-
Hi Elvia,
Good question. This situation has come up on occasion and yes we have "re-staged" original website files back on a live web server and then crawled them, though it does have some impact on the metadata in the WARC files, as you note. This, uh, "zombie crawl" does facilitate easier access via Wayback as well as integration into existing web archive collections even with these metadata issues. Specific answers:
- Yes, staging them on a temporary subdomain is a good approach.
1. Yes, depending on the technicals of "semi-public." VPN, no. Router hole punching, er, maybe. Archive-It crawlers can get behind basic username/password credentialing systems as discussed in https://support.archive-it.org/hc/en-us/articles/208001306-Archiving-password-protected-sites but it depends on the system. Maybe submit a support ticket and a Web Archivist can give additional guidance.
2. We would not recommend changing crawl dates in the WARC files, as they are considered immutable archival records. "Spoofing" the crawl dates on the Wayback calendar page to reflect the original live date may be possible, but we would have to look into that. In either case, we would suggest adding metadata to the seed/captures/both to denote the unique re-crawl origin and note provenance information on the original files. You may also want to make the original files available for direct download/access in a zip file or dataset or something similar as well in order to supplement the "web" captures.Hope that helps! Someone can assist further via a support ticket if you decide to move forward.
- Jefferson
-
Hi Jefferson,
I just wanted to close the loop on this conversation. We were able to work with the campus unit who put the files on a new and temporary subdomain which allowed us to crawl the websites. We added metadata to reflect the original date:https://archive-it.org/collections/5613?q=communications.uci.edu&page=1&show=Sites
Thank you,
Elvia
Please sign in to leave a comment.
Comments
3 comments