National Archives Digitization Tools Now on GitHub
As part of our open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.
Over the last year and a half, our Digitization Services Branch has developed a number of software applications to facilitate digitization workflows. These applications have significantly increased our productivity and improved the accuracy and completeness of our digitization work.
We shared our experiences with these applications with colleagues at other institutions such as the Library of Congress and the Smithsonian Institution, and they expressed interest in trying these applications within their own digitization workflows. We have made two digitization applications, “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” available on GitHub, and they are now available for use by other institutions and the public.
This application functions like a digitization Swiss army knife. The application allows a user to analyze the contents of a file system or external drive and generates statistics about the contents of the contained directories. The application can be used to generate checksum values to ensure the bit-level integrity of files after they have been copied to a new device. After a collection of files have been converted from one digital format to another, this application can verify that there is a one-to-one match of before and after files. For the 1940 Census project, NARA’s Digitization Services scanned and indexed 3.9 million images that will be published online. This application was critical to ensuring that each original file was accounted for in the final set of files that will be published online!
The File Analyzer can also import data created in an external spreadsheet. File Analyzer results can be matched and merged with auxiliary data from an external spreadsheet or finding aid.
The GitHub repository for the “File Analyzer and Metadata Harvester” contains additional information about this application.
This application is used to analyze technical properties of individual frames of a video file in order to detect quality issues within digitized video files. Within video files, the quality issues that might arise vary from collection to collection. This application allows the user to configure the tests to be performed against a file and to tailor those setting to a specific collection. Staff in the AV Preservation Lab saw a 50% reduction in the time that it took to perform quality checks. The quality checks changed from purely subjective criteria to objective criteria plus a manual review of suspect files. NARA shared a prior version of this application with the Smithsonian Institution and they saw similar results.
Courtney Egan from NARA’s Digitization Services is scheduled to give a presentation on the use of this application to the Association of Moving Image Archivists in November.
The GitHub repository for the “Video Frame Analyzer” contains additional information about this application.
Both the “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” were developed by Terry Brady, Information Technology Specialists for the National Archives in consultation with staff from the Digitization Services Branch. Terry has recently left the National Archives, but we would like to thank him for his important work in developing these applications and making them available on GitHub. The National Archives hopes that these applications will not only be useful, but also enhanced by the larger community of cultural institutions.