Background
Panel |
---|
Born-digital collections present a challenge to traditional ways of describing content and making it discoverable. Disks, drives, directories, etc. may contain many thousands of files which would be difficult to describe in detail in an EAD. The approach described here starts with an EAD that contains only a single reference to the born-digital portion of a collection, expressed as a specific Series within the larger collection. The Forensic Toolkit (FTK) software is used to produce both disk images and related analysis files, plus detailed technical information about individual files on the media. The output of this process is augmented, transformed, and eventually translated into a set of digital objects in DOR (Stanford's Fedora-based Digital Object Registry) and shared objects for Hypatia. A separate process is used for converting the Collection EAD into metadata objects representing the full context of the born-digital and other materials, with links made between the containers and the detailed objects. |
Section | |||||
---|---|---|---|---|---|
|
...
|
...
Sample of transformed output:
|
Each <file> segment will be a the basis of a separate DOR or Hypatia Digital Object. The differences are:
- Stanford DOR (Digital Object Registry) objects are metadata-only, with content externally managed. Objects at Stanford will not have "content" datastreams.
- Stanford objects have an identityMetadata datastream that may or may not be present in Hypatia demo objects. Regardless, it is not a standard part of Hydra-compliant objects.
Collection and Series objects
The Collection and Born-Digital Series objects themselves are created first, ahead of FTK processing. All FTK processed materials for a collection are processed together and are members of the Born-Digial Series set. Media objects must be linked to the appropriate series via an isMemberOf relationship.
Media (e.g. Disk Image) objects
The FTK processing must first create a set of media objects representing the physical media (hard drive, diskette, etc) on which the files were found. This has been described as a view of the "unprocessed" collection, meaning it has not been processed down to the individual units of content, the separate files.
Note that a Media object has characteristics of
- an "item" -- it represents a unit of meaning and has content "parts" as separate objects
- a "set" -- it has object related to it as members ... should we consider a specialized relationship for this?
Sample of the starting lines of the .txt file describing the media object.
Panel |
---|
Created By AccessData® FTK® Imager 3.0.1.1467 110406 Case Information: |
Note that some of this information is repeated as header information in the file-level FTK report below.
From: Disk Image // CMnnn.001.txt | maps to | notes |
---|---|---|
Evidence Number: CM004 | descMetadata | Would correspond to EAD <c><unittitle> |
Evidence Number: CM004 | descMetadata | Would correspond to EAD <c><unitid> |
Case Number: M1437 | DC |
|
Notes: 5.25 inch Floppy Disks | descMetadata | Would correspond to EAD <physdesc> in a node describing the media. |
(implied) | RELS-EXT | A link to the Collection object |
(implied) | RELS-EXT | A link to the Series object |
identityMetadata – label = colleciton context...
File objects
File objects are the node objects representing individual files. The atomistic model would have these objects constructed as a parent (metadata) object and a child (content) object. For simplicity, we will create these File objects as a single object, combining the Hydra commonMetadata and genericContent models.
Sample of transformed FTK file (though the converison modules will not use this as input, but rather work from the same mapping rules that express this output).
Panel | |
---|---|
<ftk_report | |
Panel | |
|
Each <file> segment will be a the basis of a separate Hypatia Digital Object. This is an example where the atomistic model adds overhead (separate metadata and content objects) and an integrated object combining commonMetadata and genericContent could be considered.
|
Information from: FTK xml // Report_transformed.xml | maps to (within item objects) | notes | |||
---|---|---|---|---|---|
<filename>BU3A5</filename> | descMetadata | this is the original file name as it appeared on the original media. | |||
<Item_Number>1004</Item_Number> | descMetadata | internal FTK reference only, to disambiguate references in the FTK report | |||
<filepath>CM006.001/NONAME [FAT12]/[root]/BU3A5</filepath> | descMetadata | ||||
Information from: Disk Image // CMnnn.001.txt | maps to | notes | <collection_title>Stephen J. Gould Papers | Collection object | Equivalent to EAD <archdesc><title> |
<series>Series 6: Born Digital Materials | Series set object | Equivalent to EAD series <c><unittitle> | |||
<note>5.25 inch Floppy Disks</note> | Series object | Corresponds to EAD <physdesc> | |||
<callnumber>M1437</callnumber> | Collection object | Corresponds to EAD <archdesc><did><unitid> | |||
Information from: FTK xml // Report_transformed.xml | maps to | notes | |||
<filename>BU3A5</filename> |
|
| |||
<Item_Number>1004</Item_Number> |
|
| |||
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2c7548bc-0b29-4a92-8527-93ce0efb51c3"><ac:plain-text-body><![CDATA[ | <filepath>CM006.001/NONAME [FAT12]/[root]/BU3A5</filepath> mods:physicalLocation> | ]]></ac:plain-text-body></ac:structured-macro> | |||
<disk_image_no>CM006</disk_image_no> |
|
| |||
location of file on original media | |||||
<disk_image_no>CM006</disk_image_no> | Naomi thinks this should be handled with a link to the media (disk image) object | This token, taken from the head of the <filepath>, is the only data link between the FTK output for a file object and the corresponding media object. We want a data link in descriptive metadata as well as an RDF link to the corresponding object. | |||
<filesize>35654</filesize> | descMetadata - human friendly | This should be a human friendly version of the file size. The machine friendly version is in contentMetadata. | <filesize>35654</filesize> |
| |
<filesize_unit>B</filesize_unit> |
| use to determine filesize in bytes (convert to bytes if nec) | Needed to correctly interpret <filesize>, if used | ||
<file_creation_date>n/a</file_creation_date> | descMetadata |
| |||
<file_accessed_date>n/a</file_accessed_date> | descMetadata |
| |||
<file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date> | descMetadata |
| |||
<MD5_Hash>976EDB782AE48FE0A84761BB608B1880</MD5_Hash> | contentMetadata | Used for checksum validation of a file during processing. This value will eventually be part of contentMetadata, but probably not as a value transferred from here. | |||
<restricted>False</Restricted> |
| true=visible staff only, not discoverable .... Hypatia only | |||
<label name="Medium">5.25 inch Floppy Disks</label> |
| Part of <location> (1) | |||
<label name="Type">Books</label> |
| tag | |||
<type>Books</type> | Naomi sez: mods doc says this is controlled vocab, so this won't work ... [http://www.loc.gov/standards/mods/mods-outline.html#typeOfResource | ||||
<title>The Burgess Shale and the Nature of History</title> | descMetadata | This is not the title of the file or the file content directly, but the author's title to which the file relates. | |||
<filetype>WordPerfect 4.2</filetype> | descMetadata |
| |||
<Duplicate_File> </Duplicate_File> |
| * blank, null value or empty string - original file , not a duplicate | |||
<export <Export_path>files\BU3A5.wp</Exportexport_path> |
| The file as saved by FTK for further processing. | |||
(implied) | RELS-EXT | A link to the Media object |
(1) Location/container information -- for every file object created, create a <mods:location> description that places the resource in the context of the collection by combining collection name, intermediate series/group/etc name(s), and the ID+description of the media on which the file resides:
<location>collection-title - series-title - media-title (media-description)</location
e.g.,
<location>Stephen J. Gould Papers - Series 6: Born Digital Materials - CM006 (5.25 inch Floppy Disks)
Use this concept for objectLabel?