Partners can archive streaming audio and video files from the Internet Archive (archive.org) that are embedded in the pages that they collect. In order to capture these archive.org embeds, it is necessary to apply the following crawl scope modifications:
- Ignore robots.txt (either entirely at the seed level or on the host archive.org at the collection level)
- Expand scope to include URL if it matches the regular expression:
^(https?:)?\/\/[a-z0-9.-]*archive.org\/(.*\/|)(items|download|includes)\/.*$
With these rules in place, audio and video on captured pages should play normally.
Comments
0 comments
Please sign in to leave a comment.