Term
|
Definition or explanation
|
---|---|
Active |
Description of a collection or seed that can have, or be included in, scheduled crawls. |
Archive |
The process of copying digital information into a repository for storage, preservation, and access purposes. In web archiving, often synonymous with capture. |
Brozzler |
A distributed web crawler that uses a real browser (chrome or chromium) to fetch pages and embedded urls and to extract links. It also uses youtube-dl to enhance media capture capabilities. |
Collection |
A group of archived web documents curated around a common theme, topic, or domain. |
Crawl |
A web archiving (or "capture") operation that is conducted by an automated agent, called a crawler, a robot, or a spider. Crawls identify materials on the live web that belong in your collections, based upon your choice of seed URLs and scope. Crawl can also reference the archived content associated with the action. |
Crawl budget |
The amount of data that may be collected at a given subscription level. |
Crawl frequency |
The rate at which you set your seeds to be crawled. The frequency is on a per seed basis and can be set to one time, twice daily, daily, weekly, monthly, bi-monthly, quarterly, semiannual, or annual. |
Crawler |
Explores the web and collects data about its contents. A crawler can also be configured to capture web-based resources. It starts a capture process from a seed list of entry-point URLs (EPUs). |
Crawler Trap |
Part of a site that can generate an infinite number of (often invalid) URLs. |
Curator |
Anyone responsible for building a collection or collections of web-based resources, including those who specify seed lists for specific crawls. |
Data De-duplication |
The in-crawl process of checking for identical content, in an effort to avoid collecting the same data more than once. For specific information on how this works see About data de-duplication. |
Directory |
Segments of a host domain in which individual files and/or further directories can be found. Similar to how folders are used to organize content in your computer's local structure. Most, but not all websites use a directory structure. More information: http://www.linfo.org/directory.html. |
Document |
Any file with a unique URL - html, image, PDF, video, etc. |
Domain |
The root of a host name, for example: .com, .gov, .org, etc. |
Dublin Core |
The metadata standard used by Archive-It. This standard has 15 fields that can be used to describe any kind of digital artifact, in this case an archived web page. More information: http://www.dublincore.org/documents/dces/. |
Dynamic |
Description of web-based content created automatically by software at the web server end. May be (a) personalized for the user based on identification via login or based on cookies stored on the user's computer, (b) tailored to fulfill a specific request made by the user, or (c) code-generated (e.g., using php, jsp, asp, or xml). Information used for personalization or tailoring of pages may be retrieved in real-time from a database or other data store. |
Elasticsearch |
An open source search engine utilized by Archive-It to make archived websites text searchable. More information: https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html. |
Hadoop |
A computing framework used by Archive-It in order to process, index, and distribute storage of our partners' archived data. |
Heritrix |
The name of Internet Archive's open-source, extensible, web-scale, and archival-quality web crawler project. An archaic word for heiress (woman who inherits). More information: https://github.com/internetarchive/heritrix3/wiki. |
Host |
Where web content is stored, or a single networked machine, as usually designated by its Internet host name (ex. archive.org). The host name can be identical to a URL's domain name, but not always. |
Inactive |
Description of a collection or seed that does not undergo regular, scheduled crawling, but which may at the partner's discretion remain publicly visible/searchable. |
Internet Archive |
A non-profit digital library seeking to provide universal access to all knowledge. Archive-It is a subscription service of the Internet Archive: https://archive.org. |
MIME |
Stands for Multipurpose Internet Mail Extensions. This is a specification for formatting non-text content to be sent over the Internet. A MIME file can be just about any kind of non-text file, ex: gif, jpg, html, etc. Archive-It provides a MIME report of all the different types of files archived during each crawl. |
One Page |
A crawling protocol that directs our crawler to only archive a given seed URL as a single web page, to include content necessary to render that page faithfully, but not to include any links to other web pages. |
One Page Plus |
A crawling protocol which captures one document external to your crawl's default scope if there is a link to it from an in-scope page scheduled to be crawled. |
Patch Crawl |
A crawl to capture and patch in documents that were not captured in your original crawl. |
Persistent name |
A unique name assigned to a web-based resource that will remain unchanged regardless of movement of the resource from one location to another or changes to the resource's URL. Persistent names are resolved by a third party that maintains a map of the persistent name to the current URL of the resource. |
Quality Assurance (QA) |
Tools and articles related to improving the quality of captures. |
Regular Expression (Regex) |
Patterns used to match character combinations in strings in order to find and replace strings that take a defined format. |
Repository |
The physical storage location and medium for one or more digital archives. A repository may contain an active copy of an archive (i.e. one that is accessed by end users) or a mirror copy of an archive for disaster recovery. |
robots.txt |
Files that a site owner can add to their site to keep crawlers from accessing all or parts of it. |
Scope |
What the crawler will capture and what it won’t. Scoping refers to options for telling the crawler how much or how little of a seed URL to capture. Archive-It options include seed and collection level scoping. |
Seed |
An item in Archive-It with a unique ID number. The Seed URL tells the crawler where to go on the live web and acts as an access point to archived content. |
Seed Type |
A crawling protocol that tells the crawler how many links to follow off of a seed URL. Options are Standard, Standard Plus, One Page, or One Page Plus. |
Seed URL |
The starting point URL for a crawler and access point to archived collections. |
SOLR |
Open source search platform that enabled metadata-based search for Archive-It. |
Standard |
Seed Type: A crawling protocol that directs our crawler to archive your seed URLs with its default scoping rules. Crawling Technology: Heritrix (H3) and Umbra |
Standard Plus |
A crawling protocol that directs our crawler to archive your seed URLs as it would with default scoping rules and the additional ability to include any otherwise "Out of Scope" external content directly linked from those seed URLs. |
Sub-domain |
A directory named before the root web address, for example crawler.archive.org, in which crawler is the sub-domain. |
Umbra |
A browser-based technology that Archive-It uses to navigate the web more as human viewers experience it during the crawl process. |
URL |
Stands for Uniform Resource Locator. The location of a resource on the web. |
WARC File |
An open source format developed by the Internet Archive and an ISO standard (CD 28500) for web archives. Made of disaggregated (coming from different hosts) WARC records. Typically 1GB or lower in data volume each. |
WARC Record |
Represents the capture of a distinct URL within a larger WARC File container. Records the archive date, content type, and archive length, as well as the raw byte stream. |
Wayback Machine |
Internet Archive's general/global web archive. The Wayback Machine is a piece of software that makes archived websites visible as if they were on the live web. |
Web archive |
A collection of web-published materials that an institution has either made arrangements for or has accepted long-term responsibility for preservation and access in keeping with an archive's user access policies. Some of these materials may also exist in other forms but the web archive captures the web versions for posterity. |
Web Archiving Service |
Enables curators to build collections of web-published materials that are stored in either local and/or remote repositories. The service includes a set of tools for selection, curation, and preservation of the archives. It also includes repositories for storage, preservation services (e.g., replication, emulation, and persistent naming), and administrative services (e.g., templates for collection strategies, content provider agreements, repository provider agreements). Archive-It is a web archiving service. |
Website |
A website is a collection of related web resources, usually as grouped by some common addressing – as when all resources on a single host, or group of related hosts, are considered a 'website'. |
Comments
0 comments
Please sign in to leave a comment.