When we talk about the web, we talk about pages. But really, a website is more like a jigsaw puzzle; all kinds of files fit together to complete the experience. There are some powerful tools built right into web browsers that allow you to watch this process of puzzle-solving in real time. Named differently in each browser, they are commonly known as Developer Tools. While they are (unsurprisingly) aimed at web developers, these tools can help Archive-It partners as well. You can use them to identify specific elements of a website (images, scripts, documents, and more), to either block them or scope them in, and to ensure that all files are playing back the way you expect them to. It’s a user-friendly way of reading source code!
Developer Tools can be found in different places in different browsers; the easiest way to access them is to use the keyboard shortcuts in Chrome, Firefox, or Safari. This will open a console in your web browser while a page loads in the background. The console includes tabs that can help to debug different issues. For the purposes of troubleshooting a Wayback page however, the Network tab is the most helpful to use first. As you load and/or reload a page, the Network tab will show you all of the calls and responses being made in order to build the page -- it's pretty neat!
The file names and their domains identify the pieces of the page and where each originates. If you’re thinking about how to manage your data on a specific seed, or want a leg up on scoping, taking a look at these before running a test crawl is a helpful way to identify what host domains or even specific files to scope in or out. It’s also insight into any potential robots.txt blockers.
For a quick indication of whether all of your page elements are successfully loading (be it from the live web or Wayback), check out the status column. In the far left hand column of the example above, each file has a number status and a colored dot. Anything starting with a 2 (such as 200, 204) was found and loaded with success. Anything stating with a 4 did not. In particular, any '404' status means that the file is not found. When looking at an archived page, scanning through these URLs and identifying the 404s is effectively what Wayback QA does, so you can expect anything that turns up 404 in the network console to be available for patch crawling in the web application. This can be helpful when trying to pinpoint a specific element’s source, or its precise URL, to determine whether it was captured. In fact, it’s usually the first thing that we look at when we analyze the pages in your support tickets! Use the types along the top of the console to narrow the list. For example, if you’re looking for a specific image, limit the list to only image files.
The network console gives a bird’s eye view of all the files needed to render a page, and provides a scalpel-like tool for identifying troublesome files.
Please sign in to leave a comment.