The ARCH guide introduces many technical terms and concepts. This glossary defines some of those terms to facilitate your research journey.
Alternative text (images): A short written description of non-text content (images, multimedia, etc.) on a web page. Alternative text inserted as an attribute in an HTML document contextualizes the purpose of multimedia content for people using screen readers, browsers that block images, and search engine optimization (learn more)
Anchor text: The visible, clickable text in an HTML hyperlink. Also known as ‘link label’ or ‘link’ text. (learn more)
Apache TIKA: a content type detection and extraction software. This framework parses text from over a thousand different file types, useful for content analysis. (learn more)
ARCH (Archives Research Compute Hub): A collaboration between the Archives Unleashed Project and Archive-It to meet web archival research needs at scale.
Collocation: A series of words or terms that co-occur and become established through repeated context-dependent use. Full text datasets provided by ARCH enable extraction and analysis from a collection’s web documents. (learn more)
Command line: A user interface that’s navigated by typing commands at prompts instead of using a mouse. The command line is one way that a computer’s operating system represents the computer’s files, directories, and programs. wget and curl are two examples of command line programs that can retrieve content from web servers.
CSV format: ‘Comma-separated values’ file format, represented by filename extension .csv. ARCH datasets can be extracted in this format, presenting data attributes and values as plain text. (learn more)
Derivative data sets: Data built from metadata, provenance information, entities, links, and other key elements. Common examples of derivative datasets built from Web Archive (WARC) files include WAT (Web Archive Transformation), LGA (Longitudinal Graph Analysis), and WANE (Web Archive Named Entities). (learn more)
Domain (including Source Domain & Destination Domain): As part of a URL, the domain is the text-based label used to identify hosts for Internet resources, composed of the type of server (ex. www), the host name (ex. archive-it), and the top-level domain (ex. .org, .com, .edu, etc.). Network datasets generated by ARCH enable visualizations of domain-to-domain links, where the ‘source’ domain represents the URL of a specific web page, and the ‘destination’ domain represents the URLs of the web pages linked from the source. (learn more)
Geocoding: The process of linking a text-based description of a location (such as an address, coordinates, or name of a place) to a location on the Earth’s surface. (learn more)
Geoparsing: The process of linking free-text descriptions of places (such as colloquial or relative directions) to geographic identifiers (including coordinates, addresses, etc.). Where geocoding analyzes structured location references, geoparsing handles ambiguous references using special software or services. (learn more)
grep: A command-line utility for searching plain-text datasets for lines that match a regular expression. grep can be used to create a more manageable sample of data when working with large datasets. (learn more)
Hash (including MD5 and SHA1): Useful for information security, authentication, and data indexing, cryptographic hash functions are algorithms that map data of an arbitrary size (referred to as the ‘message’) to a bit array of a fixed size (called the ‘hash value’ or ‘message digest’). MD5 and SHA1 are two examples of cryptographic hash functions, and both are available outputs of ARCH Binary Information Files. (learn more)
MIME (Multipurpose Internet Mail Extensions): An Internet standard that extends the SMTP (Simple Mail Transport Protocol) to support text in character sets beyond ASCII (American Standard Code for Information Interchange), as well as audio, video, and image attachments and other applications. ARCH Full Text and Binary Information datasets return the detected MIME type for collected documents. (learn more)
N-gram: Sets of co-occurring words, symbols, or tokens in a contiguous sequence, where n represents the number of collocated items. N-grams are useful for natural language processing and text mining and can be generated using text analysis methods and ARCH Full Text datasets. (learn more)
NER (Named-entity recognition): An information extraction task that classifies named entities into predefined categories, such as personal names, organizations, locations, etc. (learn more)
NLTK (Natural Language Toolkit): A leading platform for building Python programs to work with human language data. (learn more)
Node: In the context of ARCH Network datasets, nodes represent the webpages, images, domains, and documents that populate graphed networks, where edges are created via their hyperlinks.
NodeXL: A network analysis and visualization package for Microsoft Office Excel. (learn more)
OCR (Optical character recognition): The electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text. Software like Tesseract and Poppler can be used in conjunction with ARCH PDF datasets to perform OCR and text extraction required for further analysis. (learn more)
OpenRefine: An open-source application for data cleanup and transformation to other formats. This tool can help simplify tasks required to work with messy data. (learn more)
pandas: An open source data analysis and manipulation software library, built on top of the Python programming language. pandas can be used to create a more manageable sample of data when working with large datasets. (learn more)
Python: A programming language widely used to build websites and software and conduct data analysis. Python-supported software services can parse ARCH output files to facilitate research. (learn more)
R: A programming language widely used to develop statistical software and data analysis tools. R-supported software services can parse ARCH output files to facilitate research. (learn more)
Seed: Within Archive-It, a seed is identified by a URL (a website, a specific directory, or a specific document) and acts as 1) the starting place for a web crawl and 2) an access point for archived web documents within a collection. Seeds and the scoping rules applied when crawling will impact the precision and recall of documents available for archival or research purposes. (learn more)
Sentiment analysis: Understanding the subjective information in a collection of text, identifying opinions, appraisals, emotions, or attitudes towards a specific topic.
Topic modeling: A type of statistical modeling for discovering the semantic topics or structures that occur in a body of text. (learn more)
Voyant: A free web-based text analysis platform, useful for quickly and easily visualizing data and exporting visualizations. (learn more)