Impact of bots on Wayback?

Comments

3 comments

  • Avatar
    Karl Blumenthal

    Thanks for asking Sarah! 

    Can you please tell us more about how and/or where you are trying to access the missing calendar page in this example so we can retrace your steps? I don't see this URL as a seed on the public collection page for instance. Is there somewhere else that I can start?

    0
    Comment actions Permalink
  • Avatar
    Sarah Weeks

    Karl, thanks for replying. I see now, that exact seed is not in our collection. (Wustl.edu has changed to washu.edu for many of our sites, and I'm working on how to handle that.) Here's a different example:

    I started at this collection:
    https://archive-it.org/collections/20310
    Some of those items say 0 captures, some get hung up on "Loading Wayback Info," and some show their correct number of captures.

    Clicking on the second item, titled Jan Castro, I was taken here:
    https://wayback.archive-it.org/4726/20250512205823id_/http://jancastro.com/ 

    Going back to the collection and clicking the item again, I do get to the calendar page. Clicking on a blue dot, I get this:
    https://wayback.archive-it.org/4726/20210625214051id_/http://jancastro.com/
    ...which is a Not in Archive page. Trying again, I get the capture as it's supposed to look. 

    I have also had the experience with this Jan Castro example of being taken straight to a capture, bypassing the calendar page. 

    Another, separate example:
    In trying to access this item https://wayback.archive-it.org/19943/*/https://sites.wustl.edu/transgenderspectrumconference/
    I see a blue dot I can click it and get to the capture. Clicking around within the capture, I get: https://partner.archive-it.org/missing_url_record?reason=INVALID_UNKNOWN&referrer=https%3A%2F%2Fwustl.app.box.com%2Fs%2Fgn2vbea8q9i2g7ay0sbhcg50zwfein8t&mime=&status=404&size=0&collId=4726×tamp=20250512205823&url=https%3A%2F%2Fsites.wustl.edu%2Ftransgenderspectrumconference%2F

    Now, that one I've never seen before - and I'm not sure how wustl.app.box.com got mixed up in there, but I know Box pages to be uncapturable. 
    I have also had this missing_url_record thing happen on other seeds I was trying to access. 

    Reloading, and backing out and trying again, seem to do a lot towards addressing both problems. 

    0
    Comment actions Permalink
  • Avatar
    Karl Blumenthal

    Gotcha! Thank you for that context. May I ask if the issues persist even after you 1) disable Wayback QA, and 2) clear your web browser of any old data from wayback.archive-it.org?

    1. To turn off Wayback QA, click the link in the banner:

    2. To remove the old data from Archive-It by web browser:

    • Chrome: Go to chrome://settings/content/all?searchSubpage=archive-it and click the trash can icon.
    • Brave: Go to brave://settings/content/all?searchSubpage=archive-it and click the trash can icon.
    • Safari: From the menu, select Safari > Settings... > Privacy > Manage Website Data... and search for archive-it. Press the "Remove" button, followed by "Done."
    • Edge: From the menu, select Settings and more > Settings > Cookies and site permissions. Under "Cookies and data stored," select Manage and delete cookies and site data > See all cookies and site data. Search for "archive-it." Click on the "Delete" button with the trash icon.
    • Firefox: Go to about:preferences#privacy and click on the "Manage data" button. Search for and select "archive-it." Press the "Remove selected" button, followed by "Save changes."
    0
    Comment actions Permalink

Please sign in to leave a comment.