Hypatia data loading update as of Saturday, September 17, 2011

Executive Summary

I have a first pass at a complete data load for Gould checked into github, and it's ready for loading onto a server as soon as Naomi gets a chance. That means both FTK file items (modeled in the hypatia software as "HypatiaFtkItems" and also known as "processed collections") and FTK disk images (modeled in the hypatia software as "HypatiaDiskImageItems" and also known as "unprocessed collections"). The objects we're creating use atomistic modeling for their fedora objects (with a couple of shortcuts, which I'll name here), and there is an isMemberOf relationship between a file and the disk it came from. I had to take my best guess at some of the metadata mapping, and Peter Chan has already found some problems that I'm going to correct, but in the meantime I'd love some feedback on the object data. Here is some real world data, pulled hot and fresh from my local hypatia instance. 

HypatiaFtkItem

This represents a file from a processed collection. This example is for a WordPerfect 4.2 file called "BURCH2," which came off of a 5.25 inch floppy disk known as "CM005."

descMetadata for HypatiaFtkItem
<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
<mods:titleInfo>
<mods:title>The Burgess Shale and the Nature of History</mods:title>
</mods:titleInfo>
<mods:location>
<mods:physicalLocation type="disk">CM005</mods:physicalLocation>
<mods:physicalLocation type="filepath">CM005.001/NONAME \[FAT12\]/\[root\]/BURCH2</mods:physicalLocation>&nbsp;
</mods:location>
<mods:originInfo>
<mods:dateCreated>n/a</mods:dateCreated>
<mods:dateOther type="last_accessed">n/a</mods:dateOther>
<mods:dateOther type="last_modified">10/20/1988 10:44:46 AM (1988-10-20 17:44:46 UTC)</mods:dateOther>
</mods:originInfo>
<mods:typeOfResource></mods:typeOfResource>
<mods:physicalDescription>
<mods:form></mods:form>
</mods:physicalDescription>
</mods:mods>

Mark Matienzo pointed me toward a good MODS reference (http://library.princeton.edu/departments/tsd/metadoc/mods) and I used that to include file created / accessed / modified dates, but there is no mods authority for "last_accessed" or "last_modified," I made those up. Is it worth establishing some controlled vocabulary here? (Maybe something like what Princeton did in the reference I've been using here: http://library.princeton.edu/departments/tsd/metadoc/mods/dates.html) Is there a better place to include these?

contentMetadata for HypatiaFtkItem
<contentMetadata objectId="hypatia:423" type="born-digital">
<resource data="metadata" id="analysis-text" objectId="hypatia:424" type="analysis">
<file format="WordPerfect 4.2" id="BURCH2" size="58715 B">
<location type="filesystem">files/BURCH2</location>
<checksum type="md5">0DDF3CB211DECC768500E008BD181949</checksum>
<checksum type="sha1">F93D649ED1DA6D3EDE0679DB1EF39490C6FDA4BE</checksum>
</file>
</resource>
</contentMetadata>

Note that we're describing a file that is contained in the object "hypatia:424". One concern here is whether location is meaningful. It currently contains the location to which FTK exported the file in question, not the file's location on its source disk. Should it instead read like this?

<location type="filesystem">"CM005.001/NONAME [FAT12]/[root]/BURCH2"</location>
RELS-EXT for HypatiaFtkItem
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="info:fedora/hypatia:423">
<isMemberOf xmlns="info:fedora/fedora-system:def/relations-external#" rdf:resource="info:fedora/hypatia:165"></isMemberOf>
<hasModel xmlns="info:fedora/fedora-system:def/model#" rdf:resource="info:fedora/afmodel:HypatiaFtkItem"></hasModel>
</rdf:Description>
</rdf:RDF>

You can see that it has a model of "HypatiaFtkItem" and has an "is_member_of" relationship to hypatia:165, which is a HypatiaDiskImageItem.

FileAsset for HypatiaFtkItem

This contains the file "BURCH2" in the content datastream, plus an HTML display derivative (so you don't have to have a copy of WordPerfect 4.2 in order to read the file's content) in the "derivative_html" datastream.

 

HypatiaDiskImageItem

This represents the FTK generated and analyzed disk image for a 5.25 inch floppy disk known as CM005. Collections that have undergone disk image level analysis but not file level analysis are referred to locally as "unprocessed collections," but it is worth noting that processed collections, like the Gould collection, still have disk level data objects.

descMetadata for HypatiaDiskImageItem hypatia:165
<mods:mods xmlns:mods="http://www.loc.gov/mods/v3">
<mods:titleInfo>
<mods:title>CM005</mods:title>
</mods:titleInfo>
<mods:physicalDescription>
<mods:extent>5.25 inch Floppy Disk</mods:extent>
<mods:digitalOrigin>Born Digital</mods:digitalOrigin>
</mods:physicalDescription>
<mods:identifier type="local">CM005</mods:identifier>
</mods:mods>
contentMetadata for HypatiaDiskImageItem hypatia:165
<contentMetadata objectId="hypatia:165" type="born-digital">
<resource data="content" id="disk-image" objectId="hypatia:166" type="disk-image">
<file format="BINARY" id="CM005" size="368640 B">
<checksum type="md5">9914f13d38333d43369b636a54ec1368</checksum>
</file>
</resource>
</contentMetadata>

contentMetadata is showing that the actual "CM005" disk image is in object hypatia:166, which is a FileAsset.

FileAsset for HypatiaDiskImageItem

A HypatiaDiskImageItem has a payload file that we store in the "content" datastream of a FileAsset object. However, it also has several other files. Here you can see a datastream called "front" that contains a photo of the disk in question. If the back of the disk had also been photographed, there would also be a datastream called "back." I'm also planning to put the .csv and .txt files that FTK generates as datastreams here. That seems more appropriate (and easier) to me than handling them as separate FileAssets, but we may want to re-examine this in later phases of the project.

 

Next steps:

- Adding relationships to the collection object, so we can easily see all of the files connected to a collection.

- Running this processing against other FTK collections. 

- Creating HypatiaItem objects for files described in EAD, which will follow the same datastream patterns described here, but will probably need some customization. 

Questions:

- Do we want to pull information from the collection into any of these objects. I notice there's nothing in these that says they're from the Stephen Jay Gould archive. Of course that can show up in the interface without the data being in the object. On the one hand, including that information is more work and creates a dependency on having a collection object first before the FTK objects are created. On the other hand, it seems like a good idea to provide a bit of context in these objects, but I don't have a sense of how important that is. 

  • No labels