One date for the site capture but different capture dates as you drill through "pages"
I have a researcher that is going to do an in-depth content analysis of a specific site that we have captured starting in 2010. The captures of the site "Tobacco Issues" span from 2010 to January 2017 - https://wayback.archive-it.org/5622/*/http://www.tobaccoissues.com/
I notice when I click through the May 7, 2010 capture (https://wayback.archive-it.org/5622/20100507212107/http://www.tobaccoissues.com/) some of the pages jump me to a different capture date (December 23, 2011) within the May 7 capture. I am supposing that the Wayback is somehow filling in gaps but with content over a year into the future? I went to the URL on WayBack as well which of course has many more captures of this site than my collection and this jump within captures happens as well. Just wondering if anyone knows why this happens, what the mechanism is, so that I can communicate back to the researchers. Unfortunately, this makes the webarchives not actually a real "capture" in time which is problematic for content analysis.
Thanks!
-
Hi, Rachel! In this case you have links off of the archived landing page for this seed that were not captured when it was originally crawled in May 2010, but that were captured during subsequent crawls, for instance starting in December 2011. When you follow a link in Wayback from one archived page to another, it defaults to showing you the most temporally proximal capture of the latter. This is especially important for replay because even a single crawl can run for multiple days and therefore link pages that were archived on different dates. In either case this is preferable to the “Not in Archive” message when that latter capture exists somewhere in the archive because it would be misleading to say that the page wasn’t archived—it was, but just not necessarily on the same date as the source page.
Depending upon your researcher, their level of familiarity with web archives, the kinds of analysis that they intend to perform, etc., there should still be a way for them to look at only the captures that were made up to a specified date/time in your archive. Either here or directly, is there anything more you can tell me specifically about their intent?
-
Thanks Karl! I appreciate you clearing up the dates in the preserved website and how it affects replay. I agree, filling in the page with one from a close date is preferable to nothing and I think the only time that poses a problem for our researchers is when they are coding websites for a specific study and need to be able to say without certainty that perhaps an ad campaign had this slant within this period of time or something like "on a specific date, this particular e-cig advocacy site claimed that the side stream smoke was only water vapor" and when they link to that particular seed in their paper, they don't want jumps in dates within the preserved site since that may make their statement look less credible. I don't think that's a problem so much with dates that are only off by a few days or a month but they did look at one site where part of the site was from a different year (my example above) and that was disconcerting. I have noticed that this filling in of pages happens less frequently as I look at the more current captures and I think that might be due to better crawling? Anyway, mostly I just wanted to get a good clear explanation as to why this occurs so I could relay it back to the researchers and they now know to look for date changes in the banner when they click through.
Thanks again Karl!
Please sign in to leave a comment.
Comments
2 comments