Institutions often only want to archive the PDF files that are part of a site. With our "PDF only crawl" feature, partners can crawl entire sites, but ONLY the PDF files will actually be archived and counted toward your account's data budget.
Please keep in mind that Brozzler is not yet configured for PDF only crawls. When running PDF only crawls, please use our Standard crawling technology (Heritrix and Umbra).
To use this feature, begin by navigating to the list in the "Crawls" tab for your chosen collection. For any crawl that you wish to limit to PDFs only, click on its corresponding "Edit Limits" button:
Then, in the dialog box that this button opens, click the check-box next to the "PDF only crawl" option:
End by clicking the "Modify Limits button," and, when this crawl next runs, it will only archive PDFs.
When a PDF-only crawl is finished, you will be able to access the archived PDF documents by navigating to the File Types tab and clicking on the PDF file type. The Wayback link on the Seeds tab will not function because the HTML seed page is not captured.
Comments
0 comments
Please sign in to leave a comment.