In addition to the archived content that they serve, you can learn a lot about your collections from the URLs and messages that they generate in our Wayback Machine browsing tool. To understand their meanings better, consult the following explanations:
How to read and interpret Wayback URLs
Archived URLs from your collections are always formatted in a specific way. In order, they display: the Archive-It host information, the collection ID number, the date on which the page was captured (broken down as yyyymmddhhmmss and recorded in GMT), and lastly the address for the archived URL itself. For example:
http://wayback.archive-it.org/194/20080414172354/http://www.governor.state.nc.us/
is a capture of governor.state.nc.us in collection 194 as it appeared on April 14, 2008, at 5:23 PM.
When you see a * in place of the date code above, the Wayback Machine will return all dates on which a URL was archived in the form of a calendar. You can limit the dates you see on this calendar by adding a year or month before the *. There are some examples of this in the URL Date Query section below.
How to query the Wayback Machine with URLs
URL Queries
You can search Archive-It collections by URL from within the web application or from our public website (ex: http://www.archive-it.org/collections/194). When you enter a specific URL, for example http://www.governor.state.nc.us/news/pressreleases/ into the Wayback Machine search bar, your results will display as a list of dates on which the URL was archived: http://wayback.archive-it.org/194/*/http://www.governor.state.nc.us/news/pressreleases/
URL Prefix Queries
This query will display all archived links for a given domain. To search using this query method, add a * (or wildcard) to the end of a URL query, for example: http://wayback.archive-it.org/194/*/http://www.governor.state.nc.us/*
The total number of captured documents will display at the top of the screen.
Please note this number reflects the total number of archived links, but only unique URLs will be displayed. For example, you could have 1,000 links archived, but only be able to see 800 links listed. This is due to the fact that the same link has been captured multiple times. Next to each listed link you will see a number of versions; this refers to the number of different captures for each link.
URL Date Queries
This is a search by specific date or date range. This query relies on the 14 digit date code in the middle of each archived URL (yyyymmddhhmmss). You can use a combination of dates and *s to manipulate which capture dates you see in your results, for example:
- http://wayback.archive-it.org/194/20070913204539/http://www.governor.state.nc.us/ – displays www.governor.state.nc.us/ as it looked on September 13, 2007 at 20:45:39 GMT
- http://wayback.archive-it.org/194/2007*/http://www.governor.state.nc.us/– displays all 2007 captures of www.governor.state.nc.us/. In this manner you can adjust to view only results for any year of crawling by just adjusting the year in the date code.
- http://wayback.archive-it.org/194/200712*/http://www.governor.state.nc.us/– displays all dates www.governor.state.nc.us/ captured in December 2007. You can limit even further to a specific date by adding to the date code.
You can switch back and forth among these queries at any time by changing the web address at the top of your browser window.
Note that URL and URL date queries only show results for the exact URL you are looking up. When you look up www.governor.state.nc.us/, you are only seeing captures for that precise page. However, if you were viewing a page deep inside this host site and you wanted to see the other dates on which that page was captured, just manually change the date code in that page's URL to *.
How to interpret Wayback error messages
Many sites can be difficult to capture, including those that use passwords, robots exclusions, or those that are heavy on dynamic/responsive elements. In turn, the archived versions of these sites may replay incompletely through the Wayback Machine. For this reason, we strongly recommend reviewing your crawls and performing quality assurance. Whenever browsing or querying your archives in the Wayback Machine, however, you may see specific kinds of capture and/or replay problems described by the following error message.
- Not in Archive: the page you are looking for has not been archived.
- Blocked Site Error: Site owners, copyright holders, and others who fit Internet Archive's exclusion policy have requested that the site be excluded from the Wayback Machine.
- Robots.txt: A robots.txt file is something that a site owner puts on their site in order to keep crawlers like ours from accessing them. The Internet Archive retroactively respects all robots.txt, but Archive-It partners have resources to avoid them.
- Redirect Error: If the page redirects more than five times, the Wayback Machine will stop following and display this error message. This can happen particularly in sites that have lots of responsive scripts.
- Failed Connection: Generally, you should only see this message when you are trying to access seeds that have not been indexed for Wayback Machine use yet (generally right after you set up a new collection). If you see this message under other circumstances, please contact an Archive-It Web Archivist for assistance.
Comments
0 comments
Please sign in to leave a comment.