The 2021 Archive-It Partner Meeting was held virtually on September 29, 2021. For presentation materials and recordings from the event, see: Archive-It Partner Meeting Presentations, 2021. Live talks, speaker panels, and Q&A were followed by self-led rotating discussion groups on a variety of web archiving topics. Notes from some of these discussions are below.
- Access and Use
- Archive-It APIs and Integrations
- Collection Development
- Metadata and Description
- Quality Assurance
- Small Organizations and Lone Arrangers
Access and Use
- Many partners agree that access and use of Archive-It collections by users could be improved. There isn’t enough awareness, especially to researchers, that web archives are a valuable resource that they can look for. Public service/liaison librarians can help promote use of web archives to their users.
- Many collections are so recent that there will not be much demand for them for a few years yet. Tracking how collections are accessed and used, including data mining, is of great interest to partners and an overarching goal for many. Archives Unleashed was discussed as a tool that might be of help with this.
Archive-It APIs and Integrations
- One partner has had some success downloading WARCs directly to their storage server, even from their remote work location. However, this can be fraught with errors over the long-term. Archive-It could help with this by building some resiliency documentation into their Help Center guide.
- University of North Texas Libraries’ WASAPI client was endorsed as a useful tool.
- There was some good discussion around imagining an ArchivesSpace integration:
- Noted problem: Data models between our systems at the top level require many layers or stages of migration. Consider for instance how collection, seeds, etc., do or don’t map neatly to ASpace item-level descriptions. What if we had another kind of entity that was more its own and not tied to seeds or collections in the AIT web app?
- UAlbany uses the CDX API to harvest URLs and then describe them at their own item level.
- The question of DACS compliancy was raised.
- The biggest use case is updating extents and dates.
- When describing at the seed level, changing URLs could conceivably break the continuity of an integration. Another case for having another entity to represent “sites” or “items”, etc.
- Some discussion around scheduled updates like cron jobs - not user generated by push or pull.
- Experiences with ARCH, our add-on service that is currently in its beta stage, have been positive so far, with data cited as easier to work with than before (e.g. at datathons).
- Brozzler was cited as being more successful than Standard (Heritrix) for collecting the functionality of pages that employ POST requests, like "Load More" buttons. In most cases, features that employ these requests still don't work, or at least don't replay in Wayback.
- Some partners cited external web archiving tools (such as Conifer) as being useful for collecting sites that need a "human touch" while crawling, such as dynamic content that is highly dependent on human interaction. One partner had success using Conifer to capture and replay the functionality of a POST request in Wayback.
- There was some debate around what capture technology has been best for collecting Wix sites: Brozzler or Standard with additional scoping rules. A lot of student groups are using Wix to build their sites, and some institutions have to archive many Wix domains (e.g. Harvard uses 20). Replay of Wix sites is inconsistent. Although there has been some improvement with our recent Python Wayback deployment, the upgrade has not helped in all cases. Most sites built using Wix use custom URLs. Because of this, they aren't identified as Wix when they are added as seed URLs, and Archive-It's default scoping rules are not automatically applied.
- Partners underlined the various strategies for approaching collection development in web archives. Some are archiving institutional websites in the absence of a records schedule for traditional institutional records. Many web archiving decisions are guided by the pre-existing collection development policies for traditional archives (or, where it already exists, web archiving policies). They follow the administrative hierarchy of the institution to decide which sites they should collect. Regardless of the approach, it’s good practice to create detailed collection policies and notes as legacy documentation for future web archivists.
- Ethical considerations for archiving certain sites were also raised. Event-based web archiving is more urgent, and therefore there is less time for permission-seeking. This means that collection decisions have to become more strategic to protect privacy. When crawling sites of small grass roots organizations or people, it’s important to establish relationships with them. Documenting the Now and Project Stand were mentioned as useful resources for learning more on collecting content generated by important social movements.
- Some barriers to collection development include having limited resources, such as data budgets. One partner keeps a list of seeds that they decided not to crawl for future reference. Other challenges include trying to collect different, opposed perspectives within the same archives.
Metadata and Description
- Discussions included comparing practices, strategies for improving old collections that don't have any existing metadata, and current OCLC recommendations such as adding metadata at the seed level. Many partners are using Dublin Core for seed level metadata and one partner described using the API to pull the descriptive metadata out to do an audit. Other topics included:
- Accessioning seeds
- The process of manually updating metadata
- How to describe content drift and redirects
- Using ArchivesSpace
- Quality Assurance was cited as a big time investment, especially if there is desire to perform it routinely. Many only have time to do cursory quality assurance on their archives but will spend extra time on more important content. One question that always comes up is what is considered "good enough," especially when communicating expectations to stakeholders.
- It's very difficult to QA social media sites when what a given platform is "expected" to look like in Wayback is always changing. Archive-It's System Status page helps but there remain some difficulties and unpredictability.
- Other external web archiving tools, such as Conifer, are helpful for filling in some gaps that are identified during quality assurance.
Small Organizations and Lone Arrangers
- One of the biggest challenges for small organizations seems to be lack of time. Web archiving is just one responsibility of many for them, and they need to decide how much time to spend on web archiving (e.g. is 1 day a week enough?). Another big challenge is the "maintenance inflation" issue that crops up once collections grow to a certain size over time.
- Other challenges include: maintaining sustainable collections; thoughtfully developing new collections to be sustainable; justifying time spent on web archiving to higher-ups in order to get more staff/students/interns involved; managing others' expectations to collect more (especially social media); allocation of budgets and personnel; and setting up a coalition or working group in the organization in order to train others.
- Some questions that were raised are:
- How to get more resources allocated to web archiving?
- How to get supervisors to understand how time is spent while web archiving?
- Top priorities for small organizations doing web archiving were cited as setting up the initial crawls, scoping, and adding metadata. Another top priority is getting broken things fixed when it happens (e.g. COVID dashboards).
- Next in the priority pipeline for many partners are tasks like scheduling crawls, performing quality assurance, and linking web archives to public database for researchers.