Content files from old website

Comments

3 comments

  • Avatar
    Jefferson Bailey

    Hi Elvia,

    Good question. This situation has come up on occasion and yes we have "re-staged" original website files back on a live web server and then crawled them, though it does have some impact on the metadata in the WARC files, as you note. This, uh, "zombie crawl" does facilitate easier access via Wayback as well as integration into existing web archive collections even with these metadata issues. Specific answers:

    - Yes, staging them on a temporary subdomain is a good approach.

    1. Yes, depending on the technicals of "semi-public." VPN, no. Router hole punching, er, maybe. Archive-It crawlers can get behind basic username/password credentialing systems as discussed in https://support.archive-it.org/hc/en-us/articles/208001306-Archiving-password-protected-sites but it depends on the system. Maybe submit a support ticket and a Web Archivist can give additional guidance.

    2. We would not recommend changing crawl dates in the WARC files, as they are considered immutable archival records. "Spoofing" the crawl dates on the Wayback calendar page to reflect the original live date may be possible, but we would have to look into that. In either case, we would suggest adding metadata to the seed/captures/both to denote the unique re-crawl origin and note provenance information on the original files. You may also want to make the original files available for direct download/access in a zip file or dataset or something similar as well in order to supplement the "web" captures.

    Hope that helps! Someone can assist further via a support ticket if you decide to move forward.

    - Jefferson

    0
    Comment actions Permalink
  • Avatar
    Elvia Arroyo-Ramirez

    Hi Jefferson,

    Thank you so much for your response. I am waiting to hear back from my campus contact about how they want to proceed. Karl got in touch as well, so I will follow up with more information as soon as I hear from them.

    -Elvia

    0
    Comment actions Permalink
  • Avatar
    Elvia Arroyo-Ramirez

    Hi Jefferson,

    I just wanted to close the loop on this conversation. We were able to work with the campus unit who put the files on a new and temporary subdomain which allowed us to crawl the websites. We added metadata to reflect the original date:

    https://archive-it.org/collections/5613?q=communications.uci.edu&page=1&show=Sites

    Thank you,
    Elvia

    0
    Comment actions Permalink

Please sign in to leave a comment.