Each crawl report includes a Hosts tab, located between the tabs for Seeds and File Types.
For a detailed walkthrough of the host report, check out our Understanding Your Host Report video.
What's inside the report
The hosts report includes information on every distinct host site to which your crawl was led, which includes your seed URLs in addition to all other sites considered within the default or selected scope.
Data and documents
The host report's charts indicate how much data and how many documents were archived from each host site encountered by our crawler in the course of your crawl. Hover over any segment to discover its host site and data or document counts.
Precise figures and complete list of all total and new documents, and volumes of both total and new data, can further be browsed on a host-by-host basis within Host List. For information on the difference between total and new documents and data, see About data de-duplication.
Hosts
In addition to summaries of total and new documents and data, the Hosts List enumerates and lists all specific URLs recognized by our crawler on a host-by-host basis. All URLs can be reviewed by clicking the hyperlinked number in the list, which opens the specific list in a browser window. URLs can be listed as:
- Docs: All URLs encountered by the crawler.
- New docs: All URLs encountered by the crawler for the first time and therefore archived.
- Blocked: URLs that were not archived because they were excluded from crawling by the robots.txt protocol. These URLs may be immediately patch crawled or "scoped-in" so that our crawler can archive them in the future. While robots.txt exclusions can prevent our crawler from archiving desired content, they are also used to protect less desirable, more administrative sections of websites. When you see URLs listed as "Blocked," it is therefore good practice to review the list and decide whether or not the URLs are indeed valuable to your collection.
- Queued: URLs were not archived because our crawler was not able to archive them before reaching a predetermined time, data, or document limit. These URLs are effectively the next URLs in line to be crawled from each host if the crawl were to continue. Especially high numbers of queued URLs are typically an indication of a crawler trap, which can be addressed by modifying the scope of your crawl.
- Out of Scope: The crawler did not archive these URLs because it deemed them to be outside of the scope of your collection. If you find URLs in this list that you do indeed want included in your collection, they can be added either by selective patch crawling in Wayback QA, or, when they run especially long, by modifying the scope of your collection's crawls to explicitly include URLs of their type in future crawls.
Note: The crawler may discover and collect a few documents from a host before determining that the rest are out of scope.
Further information
What are these screenshot:, thumbnail:, and youtube-dl: hosts in my crawl report?
What are all these other hosts listed in my crawl's Hosts report?
How can I exclude individual hosts within a domain from archiving?
What is the difference between all and new documents/data?
If you have questions or concerns about anything that you see in the Hosts report, please reach out to an Archive-It Web Archivist for help.
Related content
Reading your crawl's seeds report
Reading your crawl's file types report
Comments
0 comments
Please sign in to leave a comment.