You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 18 Next »

Background

Born-digital collections present a challenge to traditional ways of describing content and making it discoverable. Disks, drives, directories, etc. may contain many thousands of files which would be difficult to describe in detail in an EAD. The approach described here starts with an EAD that contains only a single reference to the born-digital portion of a collection, expressed as a specific Series within the larger collection. The Forensic Toolkit (FTK) software is used to produce both disk images and related analysis files, plus detailed technical information about individual files on the media. The output of this process is augmented, transformed, and eventually transformed into a set of digital objects in DOR (Stanford's Fedora-based Digital Object Registry) and shared objects for Hypatia.

A separate process is used for converting the Collection EAD into metadata objects representing the full context of the born-digital and other materials, with links made between the containers and the detailed objects.

Stanford directory output for Gould collection contains the EAD and the content and metadata files for both Media and file objects (irrelevant files not shown):

  • M1437 Gould
    • Computer Media Photo
      • CM001.jpg
      • (etc)
    • Disk Image
      • CM001.001[.dd]

      • CM001.001.csv
      • CM001.001.txt
      • (etc)
    • Display Derivatives
      • {filename}.htm
    • EAD
    • FTK xml
      • files
        • {filename}
      • Report_transformed.xml
      • Report.fo
      • Report.xml

Note that the first 2 directories map to objects describing the physical media and will be the source of creating the "unprocessed" collection, while Display Derivatives and FTK files map to individual file content & description and will be used to create the "processed" collection.

The Import/conversion process will produce this arrangement of objects in DOR:

  • Collection object
    • Series set -- Series 1 ..."
    •    :
    • Series set -- "Series 6: Born Digital Materials"
      • Media object 1
      • Media object 2
      •    :
      • File object 1.1
      • File object 1.2
      •    :
      • File object 2.1
      • File object 2.2
      •    :

Note that even though the files originate on specific media, the "Media" objects are not sets in the DOR/Hydra sense of simple object aggregation. Instead, the file/media relationship is considered just one of many possible arrangements that can be expressed in metadata. A RELS-EXT relationship (hydra:isLocatedOn, onSourceMedia???) and a MODS <location> (for humans) express the file and media relationship, allowing this logical view:

  • Series set -- "Series 6: Born Digital Materials"
    • Media object 1
      • File object 1
      • File object 2
      •    :
    • Media object 2
    •    :

... while a simple descriptive elements like <type> allow other logical arrangements of the files based on the nature of their content.

Sample of the starting lines of the .txt file describing the media object.

Created By AccessData® FTK® Imager 3.0.1.1467 110406

Case Information:
Acquired using: ADI3.0.1.1467
Case Number: M1437
Evidence Number: CM004
Unique Description:
Examiner: Peter Chan
Notes: 5.25 inch Floppy Disk

Sample of transformed FTK file available as input:

<ftk_report xmlns:fo="http://www.w3.org/1999/XSL/Format">
    <series>Series 6: Born Digital Materials</series>
    <collection_title>Stephen Jay Gould papers</collection_title>
    <callnumber>M1437</callnumber>
    <file>
        <filename>BU3A5</filename>
        <item_number>1004</item_number>
        <filepath>CM006.001/NONAME [FAT12]/[root]/BU3A5</filepath>
        <disk_image_no>CM006</disk_image_no>
        <filesize>35654</filesize>
        <filesize_unit>B</filesize_unit>
        <file_creation_date>n/a</file_creation_date>
        <file_accessed_date>n/a</file_accessed_date>
        <file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date>
        <md5_hash>976EDB782AE48FE0A84761BB608B1880</md5_hash>
        <restricted>False</restricted>
        <access_rights>Public</access_rights>
        <title>The Burgess Shale and the Nature of History </title>
        <filetype>WordPerfect 4.2</filetype>
        <duplicate_File> </duplicate_File>
        <export_path>files\BU3A5</export_path>
    </file>

Each <file> segment will be a the basis of a separate Hypatia Digital Object.  This is an example where the atomistic model adds overhead (separate metadata and content objects) and an integrated object combining commonMetadata and genericContent could be considered.

Information from: Disk Image // CMnnn.001.txt

maps to

notes

<collection_title>Stephen J. Gould Papers

Collection object
   descMetadata
      <mods:title>
Series & item objects
   descMetadata
      <mods:location> (1)

Same as EAD <archdesc><title>

Note that both the Collection and Series object may originate with the EAD rather than from the FTK output described here.

<series>Series 6: Born Digital Materials

Series set object
   descMetadata
      <mods:title>
Item objects
   descMetadata
      <mods:location> (1)

Same as EAD series <c><unittitle>

<note>5.25 inch Floppy Disks</note>

Series object
   descmetadata
      <mods:physicalDescription>
          <mods:extent>

Would correspond to EAD <physdesc> in a node describing the media.

<callnumber>M1437</callnumber>

Collection object
   descMetadata
      <mods:identifier type="unitid" displayLabel="Call Number:">

Same as EAD <archdesc><did><unitid>


Information from: FTK xml // Report_transformed.xml

maps to (within item objects)

notes

<filename>BU3A5</filename>

n/a

this is the original file name as it appeared on the original media.

<Item_Number>1004</Item_Number>

n/a

internal FTK reference only, to disambiguate references in the FTK report

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="aa712913-57e6-4a47-9fc6-37373d3fe9f6"><ac:plain-text-body><![CDATA[

<filepath>CM006.001/NONAME [FAT12]/[root]/BU3A5</filepath>

 

location of file on original media
]]></ac:plain-text-body></ac:structured-macro>
<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="413ce299-48c0-4f81-98be-9454429b4fd3"><ac:plain-text-body><![CDATA[everything after [root] can be taken as the fully qualified filename

]]></ac:plain-text-body></ac:structured-macro>

<disk_image_no>CM006</disk_image_no>

descMetadata
   <mods:location> (1)

This token, taken from the head of the <filepath>, is the only data link between the FTK output for a file object and the corresponding media object. We want a data link in descriptive metadata as well as an RDF link to the corresponding object.

<filesize>35654</filesize>

 

Could be used by conversion to compare against the file size as computed locally, a quick check prior to checksum validation?

<filesize_unit>B</filesize_unit>

 

Needed to correctly interpret <filesize>, if used

<file_creation_date>n/a</file_creation_date>

note?

 

<file_accessed_date>n/a</file_accessed_date>

note?

 

<file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date>

note?

 

<MD5_Hash>976EDB782AE48FE0A84761BB608B1880</MD5_Hash>

 

Used for checksum validation of a file during processing. This value will eventually be part of contentMetadata, but probably not as a value transferred from here.

<restricted>False</Restricted>

 

true=visible staff only, not discoverable .... Hypatia only

<type>Books</type>

descMetadata
   <mods:subject>
      <mods:topic>

<topic? or <genre>?  authority?

<title>The Burgess Shale and the Nature of History</title>

descMetadata
   <mods:title>

 

<filetype>WordPerfect 4.2</filetype>

descMetadata
   <mods:note displayLabel="File type">

 

<Duplicate_File> </Duplicate_File>

 

* blank, null value or empty string - file is unique in collection, no duplicates
* "M" - The main file in a duplicate relationship. Neither better nor worse than the duplicate file, but simply the file examined first.
* "D" - indicates a duplicate file.

Note that this is content duplication based on having the same checksum (name conflicts are different and handled another way). The two files may or may not have the same name.  It is desirable to have a note and/or relationship in each record indicating the presence of a duplicate file in the collection. Details tbd.

<export_path>files\BU3A5.wp</export_path>

 

The file as saved by FTK for further processing.

(1) Location/container information -- for every file object created, create a <mods:location> description that places the resource in the context of the collection by combining collection name, intermediate series/group/etc name(s), and the ID+description of the media on which the file resides, e.g.,

      <location>Stephen J. Gould Papers - Series 6: Born Digital Materials - CM006 (5.25 inch Floppy Disks)

  • No labels