Omeka is a popular platform for hosting collection-based websites, especially among partners in libraries and archives. In general, our crawling technology can reliably archive these sites without special scoping modifications. However, each Omeka site can be unique, so is best practice to test crawl and review results before permanently archiving them.
Possible issues with Omeka sites
Robots.txt
As with other seeds in Archive-It collections, an Omeka site might block crawling technology from accessing part or all of its contents. For instance, Omeka site templates sometimes block crawlers from the /files/ directory that contains downloadable items and thumbnail images. For complete capture of these and/or other blocked resources, see Archive-It’s guide to avoiding robots exclusions.
Crawler traps
Like sites made with Drupal or other content management systems (CMS), Omeka sites can create “crawler traps,” which may endlessly generate new documents for the crawling technology to capture. If a crawler trap appears to distract crawlers from the desired material or generate too much data, see Archive-It’s directions for identifying and avoiding crawler traps.
Comments
0 comments
Please sign in to leave a comment.