Searching 40 TB of Electronic Records with the Swipe of a Finger
Imagine that you want to find electronic records related to a particular geographic location in a very large collection (40 TB and about 70 million files) of archival electronic records. Wouldn’t it be cool if you could pick up an iPad, have a map pop up on the screen, run your finger over the area on the map you were interested in, and have a list of relevant record collections show up on the screen next to the map? Wouldn’t it be really cool if you could then drill down through that list and see metadata about records in each collection?
Our Research Partners at the RENaissance Computing Institute (RENCI) and the University of North Carolina – Chapel Hill have demonstrated prototype tools that can carry out just such a search, and more. The development of these tools is part of the NARA and National Science Foundation (NSF)-supported Cyberinfrastructure for Billions of Electronic Records (CI-BER) Project. They recently presented their interim results at the 2011 Large Data Analysis and Visualization (LDAV) symposium in Providence, RI.
They have developed a set of tools that:
- Searches across a large collection
- Identifies files in the collection that contain geospatial information (i.e., GIS datasets)
- Identifies applications that can open the identified files
- Opens the files
- Extracts metadata from the files
- Determines the geographic coverage of the files
- Adds the metadata and the latitude and longitude each file covers to an index
They have also developed other tools that perform two types of visualizations – top-down and bottom-up.
The top-down visualization is the one described above. Using an iPad, an iPhone or a web browser you select the area on the map you are interested in and the list of relevant geospatial records appears.
Here is what it looks like on the screen:
Draw a box on the area of the map you are interested in and a list of relevant record collections shows up on the left side of the screen.
In the bottom-up visualization you start with the records and end up with the map. Here is how our Research Partners describe it:
The user is presented with a tree-map containing grey, red, yellow, and blue boxes. Each box corresponds to a directory in the collection, which may contain a number of subdirectories. Each box is scaled to the number of files it contains. The colors correspond to entries containing vector records only (red), raster only (blue), both raster and vector (yellow), and no geographic files (gray). Next to the tree-map is a physical map, provided by OpenLayers. Tapping (on a touch-enabled mobile device) or clicking on a box in the tree-map shows the bounding box of all the files in that box and shows a listing of all the geographic metadata for its directory, or in the case of a single record, the metadata for that record.[i]
Here is what it looks like:
Click on a box in the treemap on the right and a box indicating the geographic coverage of that collection shows up on the map at the left.
Three other things worth noting about these tools:
- The tools are specifically designed to work with archival collections. The collections used for developing these tools were a subset of the testbed collection of Federal and Presidential records maintained by NARA’s Applied Research Division. The tools also are designed to work with the hierarchical nature of records collections and their associated metadata.
- The collections used with these tools are housed in a repository built using iRODS data management software.[ii] NARA and the NSF have supported the development of the open source iRODS software and its predecessor, SRB, for over a decade.
- The indexing and visualization tools being developed here are open source and will be available on GitHub.
For more information about the CI-BER Project check out their blog at http://ci-ber.blogspot.com/
You can download a copy of their poster from the LDAV symposium at http://www.slideshare.net/richardjmarciano/a-system-for-scalable-visualization-of-geographic-archival-records (Hint: It is much easier to read if you download the file rather than trying to read it on the slideshare site.)