As the techniques facilitated by BitCurator and other forensic applications have entered regular usage, practitioners are identifying bottlenecks, breakdowns, and other problem areas in need of further software development. In this session, Forum participants will hear about software development projects from a range of institutional contexts, including individual-driven automation around disk image and logical file processing and documentation, cross-institutionally funded work to support HFS disk images in Archivematica, and grant-funded projects to provide access to, redact, and leverage natural language processing tools on the contents of disk images.
Automated Processing of Disk Images and Directories in BitCurator
As a means to more efficiently process large-scale digital archives and with inspiration from Jess Whyte's scripting work at the University of Toronto, the Canadian Centre for Architecture (CCA) is developing a set of software tools for automating triage, SIP creation, and description of born-digital archives within BitCurator. These tools -- collectively known as "CCA Tools" -- create consistent SIPs packaged for Archivematica from digital files, directories, or disk images, and generate pre-populated description spreadsheets containing information such as extent and date statements and a scope and content note. This talk will discuss why BitCurator is an ideal environment for automated processing, give an introduction to the CCA Tools, and discuss potential use cases and next steps.
Tim Walsh, Canadian Centre for Architecture
BitCurator Access and BitCurator NLP - Updates and Future Directions
The BitCurator environment supports a variety of digital curation activities. The BitCurator Access project extended this to the point of interaction with end users, providing and supporting a variety of access mechanisms. We developed tools that support access to disk images through three exploratory approaches: (1) building tools to support web-based services, (2) enabling the export of file systems and associated metadata, (3) and the use of emulation environments. We’ll highlight two BitCurator Access software products: BitCurator Access Webtools which supports browser-based search and navigation over data from disk images, and a set of scripts to redact sensitive data from disk images. Members of the BitCurator user community expressed that they would like tools to help in identifying and exploring information based on specific entities (e.g. people, places, organizations, events) associated with collections. The BitCurator NLP project aims to address this need by incorporating existing natural language processing (NLP) and visualization tools on top of the existing BitCurator environment and BitCurator Access Webtools. Disk images are internally complex and require the sorts of underlying software that is available through the BitCurator environment and BCA Webtools, adapted for this purpose. Disks can also contain a variety of data and document types, requiring considerable pre-processing to extract content to be processed by NLP tools. We’ll report on the BitCurator NLP project, which is building from and extend a variety of tools and initiatives to provide services that can be run independently or be called by existing software environments being used by LAMs.
Christopher Lee, School of Information and Library Science, University of North Carolina at Chapel Hill
Kam Woods, School of Information and Library Science, University of North Carolina at Chapel Hill
Developing Improved Disk Image Support in Archivematica: A Project Update
As digital archivists at New York Public Library and the University of California, Los Angeles, we both have a large number of HFS floppy disks in our collections. Our repositories have a focus on collecting in the humanities, and writers and artists in the late-1980s and early-1990s gravitated toward using early Apple computers. This would not be a problem in and of itself, but NYPL and UCLA also both use Archivematica, which was unable to identify and support work on HFS disk images. Previously, Archivematica relied solely on tsk recover to identify file systems but tsk recover does not recognize HFS and many other file systems used by early computers. Together, we sponsored a project to develop functionality in Archivematica to ingest, characterize, and extract files from HFS disk images. This talk will discuss the impetus for the project, give a report on this work, which began in early-February, and provide details on the specific development steps that make up the project. These include the development of a pre-ingest script that could be used in BitCurator or in Automation tools to identify and record file system information, allow dfxml metadata to be generated, and an extraction tool to extract files from the disk image. Presenters will also suggest possible next steps and potential development tasks that might build on the groundwork that this project has laid.
Susan Malsbury, New York Public Library
Shira Peltzman, UCLA Library