Each crawl report includes a "Hosts" tab, located in-between the tabs for Seeds and File Types:
For a detailed walkthrough of the Host Report, along with information on how to use it check out our Understanding Your Host Report video.
What's inside the report
The Hosts report includes information on every distinct host site to which your crawl was led, which can include your seed URLs in addition to all other sites considered or directed to be in scope.
Data and documents
The graphics at the top of the Hosts report indicate how much data and how many documents were archived from each host site encountered by our crawler in the course of your crawl. Hover over any segment of these charts to discover its host site and compare to others.
Precise figures and complete listings of all total and new documents, and volumes of both total and new data, can further be browsed on a host-by-host basis within the table below (for information on the difference between total and new documents and data, refer to our guidance on data-deduplication).
Hosts
In addition to summaries of total and new documents and data, the table at the bottom of the Hosts report enumerates and lists all specific URLs recognized by our crawler on a host-by-host basis. All URLs, listed in the table by the categories below, may be reviewed by clicking on the hyperlinked number in the table, which opens the specific list in a browser window. URLs may be listed as:
- Docs: All URLs encountered by the crawler.
- New docs: All URLs encountered by the crawler for the first time and therefore archived.
- Blocked: Meaning that the URLs were not archived because they were excluded from crawling by the robots.txt protocol. These URLs may be immediately patch crawled or "scoped-in" so that our crawler may archive them in the future. While robots.txt exclusions can prevent our crawler from archiving desired content, they are also used to protect less desirable, more administrative sections of websites. When you see URLs listed as "Blocked," it is therefore good practice to review the list and decide whether or not the URLs are indeed valuable to your collection.
- Queued: Meaning that the URLs were not archived because our crawler was not able to archive them before reaching a predetermined time, data, or document limit. These URLs are effectively the next URLs in line to be crawled from each host if the crawl were to continue. Especially high numbers of queued URLs are typically an indication of a crawler trap, which can be addressed by modifying the scope of your crawl in order to avoid expending your document/data budget on excessive and undesired content.
- Out of Scope: Meaning that the crawler did not archive these URLs because it deemed them to be outside of the scope of your collection. If you find URLs in this list that you do indeed want included in your collection, they can be added either by way of selective patch crawling in Wayback QA, or, when they run especially long, by modifying the scope of your collection's crawls to explicitly include URLs of their type in the future.
Further information
If you have questions or concerns about anything that you see in the Hosts report, please don't hesitate to reach out to an Archive-It Web Archivist for help.
Related content
Reading your crawl's seeds report
Reading your crawl's file types report
Comments
0 comments
Please sign in to leave a comment.