This is a question that I’ve seen pop up from time to time (OK, usually it’s worded a lot more helpfully than that, but I think you get the idea!). The scenario is this: a test or full-production crawl completes, an archived web page becomes visible in Wayback, but the fonts used for some or all of that page’s text on the live web don’t seem to match those in the archived version.
Take for instance this live site:
And how it appears after an initial crawl:
Here’s what happened, and how to fix or avoid it altogether:
The typical website hosts its fonts on the site itself (usually in the form of .ttf or .woff files), can be crawled and played back as part of an archival capture. However, some sites pull in their fonts from third party vendors through APIs (such as Adobe’s Typekit fonts and the increasingly common Google Fonts). In the latter case, the chain of API links can be too many for our crawler to consider “in scope” by default. And with the fonts therefore missing from the capture, the playback can look pretty wonky.
When discrepancies in fonts appear, there are a couple of ways to quickly diagnose the issue and respond.
First of all, we can always use the developer tools that Jillian recently introduced to reload your archived page and see if any files from a likely source, like fonts.gstatic.com, appear with a 404 (File not found) status code. Doing so with the above example shows that the browser is looking for, but can’t find, the link it needs from that host:
Alternatively, we can always rely on our post-crawl reports to pinpoint these kinds of missing URLs. If our crawler was able to find them, but they were just too many “hops” away from the seed page for it to consider in scope, they’ll be listed in the “Out of scope” column for their host, in this case fonts.gstatic.com.
Having identified the missing piece, we can adjust our crawling strategy to retrieve and fit it into the rest of our puzzle. Let’s expand the scope of our crawls to include, in this common case, all URLs with the text “fonts.gstatic.com” in future crawls:
After making this little enhancement, our subsequent archives look more like their live equivalents:
Here are the hosts that I’ve found it necessary to “scope in” past cases:
Feel free to comment with more examples if you find them, and we at Archive-It will do the same!
Please sign in to leave a comment.