The 2020 Archive-It Partner Meeting was held virtually on October 7, 2020. For presentation materials and recordings from the event, see: Archive-It Partner Meeting Presentations, 2020. Live talks and Q&A were followed by rotating discussion groups on topics chosen by the attendees and organizers. Notes from these discussions are below.
- AMA: Jefferson Bailey and James Kafader
- Archive-It APIs and Integrations
- Collaborative Web Archiving
- COVID-19 Collecting
- Metadata and Description
- New Web Archivists
- Quality Assurance
- Urgent and Spontaneous Collections
Ask Me Anything with Director of Archive-It & Web+Data Services, Jefferson Bailey, and Archive-It Engineering Manager, James Kafader
- The Archive-It team will develop support for exporting collections to DPLA. There are concerns about metadata crosswalks, but those are alleviating. This could include a new GUI layer for partners to curate their exports from Archive-It to DPLA.
- There is also desire for a community-wide integration layer supported on ArchivesSpace's end, to build upon the few custom integrations that already exist.
Archive-It APIs and Integrations
- Most participants are interested in access-oriented integrations and APIs: opportunities to provide end-user access beyond the default pages on archive-it.org. Where most get stuck is early in the implementation--they need more time to learn the existing options and more openly documented precedents to build upon. For ArchivesSpace in particular, the Bentley Historical’s integration was cited multiple times as a good starting point to learn from. More are needed from the realm of OpenSearch full-text search API implementations.
- The University of Georgia’s presentation today was cited as a good lesson in using WASAPI and a home made script to ingest WARC files into local preservation repository storage. A GUI for the Archive-It web application might lower the bar significantly for others who want to download WARCs without interacting with WASAPI from the command line, though relying on a browser for that volume of data movement is tricky. In the meantime, UNT’s and Stanford’s own WASAPI clients were cited as good examples or tools to adopt.
- To improve upon WASAPI, Archive-It should make it easier to correlate the WARC files retrieved to their collections’ and seeds’ descriptive metadata. As it is, this has to be done on-the-fly by looking up the seed numbers in WARC filenames and performing more Partner API queries.
Collaborative Web Archiving
- Several attendees collaborate across institutions in order to prevent duplications of effort, especially around special projects and collecting areas. This is also seen as a way to support more focused work, such as item level description of resources like seeds.
- These collaborations have different but overlapping approaches to nominations and permissions. Participants wonder how much to engage nominators vs. automate the process for the public. One suggestions is to clearly mark some collections as open to nominations on their public-facing portals. Others suggest a dedicated forum/platform that all Archive-It partners can see and choose nominations from. The UCLA-led Cobweb project is cited as an example here, but it is not clear if that platform is supported now that its grant funding period is over.
- Similarly, some participants see value in making it easier to connect seeds or archived documents to to the partner/s who already collect them before they are added to additional Archive-It accounts and collections.
- How can collaborative projects better interlink their collections on archive.org to their web archives on archive-it.org?
- Archive-It saw a huge jump in interest in web archiving at the beginning of the COVID-19 lockdowns in the spring. Most partners are academic librarians and archivists from universities and colleges, but increased interest also came from historical societies, municipal archives, corporate archives, and individual researchers doing specific research on COVID-19.
- After the increase in interest, Archive-It announced the COVID-19 Web Archiving Campaign to help institutions sign up to create COVID-19 collections with a discounted rate. It also offered to increase data for existing partners with significant cost sharing.
- The Community Webs program for public libraries and related historical societies also continues to grow around this topic. Community Webs partners have the opportunity to attend program-specific group calls, meetings, and events.
Metadata and Description
- Most partners said they describe both collections and seeds, but not to the extent they would like, especially at the seed level. One said they try to provide an important subset of the elements to each seed, another that they use at least a baseline including Title, Date range, and Description elements for each seed URL. Another partner said that they are interested in describing some items at a document level, particularly video content. Others said they have integrated this description with the archival finding aids on their own institutions’ websites.
One partner said they follow the OCLC WAM's recommendations pretty tightly since they have a WAM working group member on staff. Others said they follow other standards to guide their metadata, such as DACs or ISAD. Some crosswalk them from DACs. Others said their application of metadata is pretty ad hoc, since it is often a manual process.
The question of user reactions and user needs came up. Most reported that they haven’t heard any user reactions to their added descriptions; one said that they only know of their colleagues' on the occasional item. Another responded that they try to provide transparency for tracking the selection of URLs using Custom Fields (a User Need mentioned in the OCLC WAM recommendations).
This ignited some in-depth discussion on the use of Custom Fields. Some used them to track the changes in seed URLs over time. Others used these fields for reaccreditation notes or tied them to institutional records. One mentioned they recently used them to track boutique subjects, like COVID. Another mentioned they would like to use them to indicate to the end user special instructions for using the archived platforms, particularly replaying media items (but had concerns about it resulting in facets on the public site, and wished for flexibility not to facet). This brought up the challenge of ongoing maintenance of metadata when replay improves on a seed, and how to signal that both to the metadata practitioner and the end users.
A stronger desire to automate more of the descriptive process was mentioned often here, the challenge of ongoing maintenance of description cited as the reason most often. Some have succeeded with integrating parts of their processes with ArchivesSpace and the CDX API, others expressed a great interest in making strides in this direction. Another mentioned their institution bulk uploads descriptions and tries to put in the work up front to minimize maintenance later.
New Web Archivists
- Selection and curation: How to decide what seeds go into what collections? How best to organize? Participants stress that individual collections have value even if the same content is also being captured by other organizations.
- Partners underline the importance of really going through test crawl process before scheduling crawls to run automatically.
- There is interest generally in using custom metadata fields for description.
- The Archive-It web application's UI for uploading images to collections is confusing and could use better documentation/instruction.
Quality Assurance (QA)
- The majority of partners work in small teams or individually. Few partners have the resources to delegate QA work. The amount of resources needed for adequate QA is a blocker to archiving workflows for some partners, especially in regards to time. Multiple partners mentioned that QA is often not feasible given how many seeds their organization crawls.
- The most common QA documentation tool is a spreadsheet. Most partners take a holistic approach towards the archiving process and track seeds from initiation to QA. (One has taken it a step further and created a database for this process.) Consensus is that spreadsheets are a good way to keep track of issues, dates, amount of data, patch crawls, comments (important part of the QA metadata), ticket responses, etc.
- Pros: Documentation! More documentation allows archiving staff to perform QA at a later date, partners have their own records if Archive-It tech has issues.
- Cons: Partners need to devote resources (time, personnel, etc.) up-front, the process involves technical knowledge of spreadsheets (or databases).
- Discussion suggestions for improving QA workflows in the meantime:
- Build a decision tree for the QA processes.
- Include site snapshots/screenshots so QA can be done at a later date.
- Download WARCs and load sites in other archiving platforms (e.g. Conifer, Webrecorder software) or archive the relevant sites using another web archiving service in order to check if Wayback replay has issues beyond Archive-It's platform.
- How to report back to supervisors/stakeholders when a capture doesn’t look perfect, and how to explain the complexity of the QA process?
- Explaining why a capture, although imperfect, is as good as it can currently be, and why there is trouble with capture/replay.
- Social media issues: trying to communicate that a platform has been captured, even when it doesn’t seem like it.
- For anyone not on the ground floor doing web archiving work, it can be hard to grasp the level of item level processing that is done for each seed. Item level processing is usually not done to such a great extent in traditional archives.
Documentation can help! Including taking screenshots of the process from beginning to end, to reflect its complexity and length.
Relevant feature requests:
- The ability to automate and run Wayback QA on the entire website at once, instead of page by page.
- Develop a Proxy Mode Add-On for browsers other than Firefox.
- Develop a QA checklist that can be used within the web app as they move through the process.
- The ability to automate and run Wayback QA on the entire website at once, instead of page by page.
Urgent and Spontaneous Collections
- There is a public expectation that social media will be archived, which can lead to headaches for partners - be ready! There are QA and privacy considerations to make here before going too far. At least one partner uses a public nomination form for these reasons, including the option to contribute anonymously.
- The point was raised that archivists don’t necessarily want to capture survivors and victims at their worst moments. You want to capture raw emotion, but also respect people. Mediated access is still possible with digital collections--it could help to mitigate privacy and confidentiality concerns, especially in the immediate term.
- Having a good collecting policy is really important before asking for submissions, donations, etc.
- Suggestion: Develop teacher guides for these collections so that they may be more useful for distance learning.
- Partners raised how important data-driven interfaces like dashboards have become to news content on the web, and the need to improve capture and replay capability with Archive-It's tools accordingly.
- Breaking news and event-related collections remind us to coordinate and collaborate across institutions, lest there be duplications of effort, important materials fall through the cracks, peer support can broaden and strengthen the corpus, etc.