Archive-It captured rendered through archive.org wayback machine
Hi AIT community,
Our web archiving team was recently alerted to an issue that a contributing web admins noticed. She indicated that she prefers to access our captures of her sites through the archive.org wayback machine public user interface (rather than through that of Archive-It). I told her that this should be ok, since everything we capture gets passed to the archive.org wayback machine, where it is also supplemented by additional captures made by others.
However, I had to eat my words last week when she noticed that the crawls are rendering differently across the two platforms. Specifically, the archive.org wayback machine is rendering at least one site without an essential css file.
Example:
Not-so-great capture in archive.org: https://web.archive.org/web/20170606201912/https://innovation.gwu.edu/
Lovely capture in archive-it: https://wayback.archive-it.org/5184/20170606201912/https://innovation.gwu.edu/
Here are some of the theories we've entertained:
1. There is a lag in processing, where some files (perhaps those grabbed in a patch crawl) take a while to make it over to archive.org's wayback machine. If so, it would be a lag of at least a month, as the css files are also missing for a May 22 capture of the page above.
2. Patch crawls don't get passed back to archive.org's wayback machine, and the missing css file was captured in a patch crawl.
3. These are not in fact the same capture, and I am incorrect in thinking that our AIT captures flow back into the archive.org wayback machine public portal.
4. The two platforms render captures differently, so it is expected that captures may appear differently across the two platforms.
Or perhaps there is another more elegant, obvious explanation staring us in the face?
Thanks, community!
-
Thanks for bringing this up, Rachel. It’s a useful example of the differences between web archival replay tools. In short, you’ve got it right with theory #4 :-) The Wayback Machine at archive.org and Archive-It’s Wayback implementation are similar, but are maintained separately and in response to the distinct needs of their users, their scale of content, etc.
In this case, for instance, because Archive-It’s developers and engineers can respond to issues when flagged by our partners, they can improve upon the replay currently achievable through the more “general” Wayback Machine or other tools. Specifically, I can see that the Wayback Machine has trouble interpreting the necessary CSS files on this example page as CSS--an issue that we’ve encountered in the past with other partners’ captures and have thus been able to address through improvements to Archive-It Wayback.
Rest assured in the meantime that the content was indeed captured and is shared between the two access points, and as improvements on the Wayback Machine also continue, replay on its side may improve in the future as well.
Please sign in to leave a comment.
Comments
2 comments