Scaling Digital Preservation with Homegrown Workflow Tools

Scaling Digital Preservation with Homegrown Workflow Tools

Automating Preservation Processes for Digitized and Born-Digital Assets

As a seasoned construction professional and interior designer, I’ve seen firsthand how innovative tools and integrated workflows can transform the efficiency and scalability of complex projects. The same principles hold true when it comes to digital preservation – a critical challenge facing cultural heritage institutions like the NC State University Libraries.

Over the past decade, the Libraries’ Special Collections Research Center (SCRC) and Digital Library Initiatives (DLI) departments have collaborated to develop a suite of homegrown applications that automate many previously manual tasks in the digital curation and preservation lifecycle. These applications, including the Special Collections Preservation System (SCPS), Wonda, and DAEV, have enabled the Libraries to ingest over 87 terabytes of digitized and born-digital special collections assets into local and distributed preservation storage.

In this article, I’ll provide an in-depth look at the Libraries’ strategic approach to scaling digital preservation through the use of these custom-built, open-source-based tools. I’ll explore the specific functionalities of each application, the integration between them, and how they have streamlined workflows for both digitized and born-digital materials. Additionally, I’ll discuss the sustainability considerations and challenges associated with managing a homegrown ecosystem of digital preservation applications.

Establishing a Digital Preservation Program

The journey towards the Libraries’ robust digital preservation ecosystem began over a decade ago, as the institution recognized the growing need to actively manage its expanding digital special collections. Between 2010 and 2018, the Libraries’ Digital Collections Technical Oversight Committee and Digitization and Digital Curation Working Group collaborated on key initiatives, including the development of a “value matrix” to determine appropriate preservation levels for different digital assets, and a self-assessment using the National Digital Stewardship Alliance’s Levels of Digital Preservation framework.

With this foundational work in place, the DLI and SCRC departments began developing specifications for a comprehensive digital preservation system in 2017, completing the initial development in 2018. This system, known as the Special Collections Preservation System (SCPS), serves as the central hub for managing the long-term preservation of the Libraries’ digitized and born-digital special collections materials.

Integrating Homegrown Tools for Workflow Automation

To support the SCPS and automate various preservation-related tasks, the Libraries have developed two additional in-house applications: Wonda and DAEV.

Wonda: Orchestrating Digitization Workflows

Wonda is an “orchestration” system that brings together multiple tools used for digitization, automating the creation of access derivatives and the publication of digital collections. Prior to Wonda, digitization workflows were semi-automated, requiring manual intervention at various stages.

With Wonda, the process has become significantly more streamlined:

  1. ArchivesSpace Integration: Wonda provides an interface for selecting archival object records from the Libraries’ ArchivesSpace instance, establishing a direct link between the digitization project and the corresponding archival metadata.
  2. Automated Derivative Creation: After a technician completes the scanning and uploads the preservation-quality files, Wonda automatically generates access derivatives, including video captions, and triggers OCR processing using the Libraries’ custom-built AVPD tool.
  3. Publication Workflows: Once the derivative processing is complete, Wonda notifies the digital collections platform to make the resource publicly available, and generates corresponding digital object records in ArchivesSpace.
  4. Preservation Ingest: Wonda then initiates the process of ingesting the digitized assets into the SCPS for long-term preservation, including replicating the content in the APTrust digital preservation service.

The modular design of Wonda allows for the easy integration of new components as needed, providing the Libraries with a flexible and extensible digitization workflow solution.

DAEV: Born-Digital Archival Processing

The Digital Assets of Enduring Value (DAEV) application supports the assessment, packaging, and description of born-digital materials, as well as their preservation ingest into SCPS. DAEV guides technicians through the use of open-source command-line tools, recording the actions taken during processing and generating preservation metadata.

Key features of DAEV include:

  1. ArchivesSpace Integration: Similar to Wonda, DAEV allows technicians to search for and retrieve relevant archival object records from ArchivesSpace, ensuring consistent metadata association.
  2. Automated Workflows: DAEV defines processing workflows in YAML files, with instructional text stored in Markdown. This enables the application to customize and automate various tasks, such as disk imaging, file format identification, and virus scanning, based on the selected workflow.
  3. Ingest Automation: In the latest version of DAEV, the application can automatically initiate the ingest of processed born-digital materials into SCPS, further streamlining the preservation process.

By leveraging DAEV, the Libraries’ SCRC staff, including full-time employees and student workers, can effectively manage the growing volume of born-digital archival materials, even if they have limited experience with command-line tools.

SCPS: The Preservation Hub

At the heart of the Libraries’ digital preservation ecosystem is the Special Collections Preservation System (SCPS), a web application that manages the long-term preservation of digital assets in both the Libraries’ local primary storage and the APTrust distributed preservation service.

SCPS serves as the central point of ingestion for digital assets, receiving packages directly from Wonda and DAEV. The application performs regular checksum validations to ensure file fixity, and prepares the assets for ingest into APTrust using the BagIt for Ruby library.

Key capabilities of SCPS include:

  1. Flexible SIP Definition: SCPS allows for a relatively flexible definition of what a Submission Information Package (SIP) can contain, accommodating the varied needs of digitized and born-digital materials.
  2. Dissemination Support: SCPS enables staff to retrieve Dissemination Information Packages (DIPs) in a flexible manner, allowing for the retrieval of entire packages or individual components.
  3. Reporting and Monitoring: The application provides reporting capabilities, such as the total number of ingests by date range and the number of ingests by asset type, as well as monitoring the status of APTrust ingests.

SCPS has played a critical role in the Libraries’ digital preservation strategy, serving as the central hub for managing the long-term stewardship of the institution’s growing digital collections.

Sustainability Considerations and Challenges

The Libraries’ approach to digital preservation, which relies heavily on homegrown applications, presents both benefits and challenges in terms of sustainability.

Benefits of the Homegrown Ecosystem:
– Ability to customize applications to meet the specific needs of the SCRC and DLI departments
– Seamless integration between the various components, enabling efficient workflows
– Opportunity for continuous improvement and iteration based on evolving preservation requirements

Sustainability Challenges:
Staffing Levels: The maintenance and evolution of the homegrown applications rely on a small team of developers, typically less than two full-time equivalents. This can create a single point of failure if key staff members leave the organization.
Lack of External Community: As the applications are not publicly available, they do not benefit from a broader community of users and contributors. This can lead to duplicated efforts and a potential lack of project memory over time.
Dependency on APTrust: The Libraries’ digital preservation strategy is heavily dependent on the ongoing sustainability of the APTrust digital preservation service. While APTrust has demonstrated a strong commitment to stewardship continuity, any disruption to this service could have significant implications for the Libraries’ preservation efforts.

To address these challenges, the Libraries continues to assess the long-term viability of its homegrown preservation ecosystem and explore potential alternatives, such as adopting open-source tools like DART for ingest into APTrust. The institution also remains actively involved in the APTrust community, contributing to the development of the service and ensuring its sustainability.

Conclusion: Scaling Digital Preservation with Homegrown Tools

The NC State University Libraries’ strategic approach to digital preservation, centered around the development of custom-built applications like Wonda, DAEV, and SCPS, has enabled the institution to effectively scale its digital curation and preservation efforts. By automating many previously manual tasks, these homegrown tools have streamlined workflows, improved efficiency, and allowed the Libraries to ingest a rapidly growing volume of digitized and born-digital special collections materials into local and distributed preservation storage.

While managing a homegrown ecosystem of digital preservation applications presents its own set of sustainability challenges, the Libraries’ iterative development approach and close collaboration between the SCRC and DLI departments have allowed the institution to continuously refine and adapt these tools to meet evolving preservation requirements.

As cultural heritage institutions increasingly grapple with the challenges of managing and preserving digital assets, the NC State University Libraries’ experience offers a model for leveraging custom-built, open-source-based applications to enhance the scalability and efficiency of digital preservation workflows. By embracing innovation and automation, institutions can better position themselves to safeguard their valuable digital collections for generations to come.

Resources

Scroll to Top