Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Background

Panel

Born-digital collections present a challenge to traditional ways of describing content and making it discoverable. Disks, drives, directories, etc. may contain many thousands of files which would be difficult to describe in detail in an EAD. The approach described here starts with an EAD that contains only a single reference to the born-digital portion of a collection, expressed as a specific Series within the larger collection. The Forensic Toolkit (FTK) software is used to produce both disk images and related analysis files, plus detailed technical information about individual files on the media. The output of this process is augmented, transformed, and eventually translated into a set of digital objects in DOR (Stanford's Fedora-based Digital Object Registry) and shared objects for Hypatia.

A separate process is used for converting the Collection EAD into metadata objects representing the full context of the born-digital and other materials, with links made between the containers and the detailed objects.

Section
Column
width50%

The Stanford Stephen J. Gould collection will be used to describe the initial implementation of this process.  It is believed to represent a template for similar work with other born digital collections going forward.

The directory output for the Gould collection contains

a locally transformed version of the FTK output

the EAD and the content and metadata files for both Media and file objects (irrelevant files not shown):

  • M1437 Gould
    • Computer Media Photo
      • CM001.jpg
      • (etc)
    • Disk Image
      • CM001.001[.dd]
      • CM001.001.csv
      • CM001.001.txt
      • (etc)
    • Display Derivatives
      • {filename}.htm
    • EAD
    • FTK xml
      • files
        • {filename}
      • Report_transformed.
fo
      • xml
      • Report.
xml
      • fo
      • Report
_transformed
      • .xml
  • Disk Image
  • Column

    Note that the first 2 directories map to objects describing the physical media and will be the source of creating the "unprocessed" collection, while Display Derivatives and FTK files map to individual file content & description and will be used to create the "processed" collection.

    Column
    width50%

    The Import/conversion process will produce this

    hierarchy

    arrangement of media and file objects in DOR. These are shown in bold; we are not concerned here how the Collection and Series nodes themselves get created, nor how much of the rest of the EAD is also represented as objects:

    • Collection object
      • Series set -- Series 1 ..."
      •    :
      • Series set -- "Series 6: Born Digital Materials"
        • Media object 1
          • File object 1
    Media File
          • File object 2
               :
  • File object 1
        • Media object 2
        •    :

    Any intellectual arrangement of this information -- categorization into genres (correspondence, novels, etc) for instance, or tracing iterations of a work across devices, etc -- will be a separate process of augmenting the descriptive metadata for these objects.

    Each <file> segment will be a the basis of a separate DOR or Hypatia Digital Object.  The differences are:

    • Stanford DOR (Digital Object Registry) objects are metadata-only, with content externally managed. Objects at Stanford will not have "content" datastreams.
    • Stanford objects have an identityMetadata datastream that may or may not be present in Hypatia demo objects. Regardless, it is not a standard part of Hydra-compliant objects.

    Collection and Series objects

    The Collection and Born-Digital Series objects themselves are created first, ahead of FTK processing.  All FTK processed materials for a collection are processed together and are members of the Born-Digial Series set. Media objects must be linked to the appropriate series via an isMemberOf relationship.

    Media (e.g. Disk Image) objects

    The FTK processing must first create a set of media objects representing the physical media (hard drive, diskette, etc) on which the files were found. This has been described as a view of the "unprocessed" collection, meaning it has not been processed down to the individual units of content, the separate files. 

    Note that a Media object has characteristics of

    • an "item" -- it represents a unit of meaning and has content "parts" as separate objects
    • a "set" -- it has object related to it as members ... should we consider a specialized relationship for this?

    Sample of the starting lines of the .txt file describing the media object.

    Panel

    Created By AccessData® FTK® Imager 3.0.1.1467 110406

    Case Information:
    Acquired using: ADI3.0.1.1467
    Case Number: M1437
    Evidence Number: CM004
    Unique Description:
    Examiner: Peter Chan
    Notes: 5.25 inch Floppy Disk

    Note that some of this information is repeated as header information in the file-level FTK report below.

    From: Disk Image // CMnnn.001.txt

    maps to

    notes

    Evidence Number: CM004

    descMetadata
       <mods:title>

    Would correspond to EAD <c><unittitle>

    Evidence Number: CM004

    descMetadata
       <mods:identifier type="???>>

    Would correspond to EAD <c><unitid>

    Case Number: M1437

    DC
      <dc:identifier>hypatia:M1437</dc:identifier>
    descMetadata
    <mods:identifier type="local">M1437</mods:identifier>

     

    Notes: 5.25 inch Floppy Disks

    descMetadata
       <mods:physicalDescription>
           <mods:extent>

    Would correspond to EAD <physdesc> in a node describing the media.

    (implied)

    RELS-EXT
       isMemberOfCollection

    A link to the Collection object

    (implied)

    RELS-EXT
       isMemberOf

    A link to the Series object

    identityMetadata – label = colleciton context...

    File objects

    File objects are the node objects representing individual files. The atomistic model would have these objects constructed as a parent (metadata) object and a child (content) object.  For simplicity, we will create these File objects as a single object, combining the Hydra commonMetadata and genericContent models.

    Sample of transformed FTK file available as input:(though the converison modules will not use this as input, but rather work from the same mapping rules that express this output).

    unmigrated-wiki-markup
    Panel
    Panel

    <ftk_report xmlns:fo="http://www.w3.org/1999/XSL/Format"> &nbsp;&nbsp;&nbsp; <series>Series
        <series>Series 6: Born Digital Materials</series> &nbsp;&nbsp;&nbsp;
        <collection_title>Stephen Jay Gould papers</collection_title> &nbsp;&nbsp;&nbsp;
        <callnumber>M1437</callnumber> &nbsp;&nbsp;&nbsp; <file> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <filename>BU3A5</filename> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
        <file>
            <filename>BU3A5</filename>
            <item_number>1004</item_number> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <filepath>CM006.001/NONAME \ [FAT12\]/\[root\]/BU3A5</filepath> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <disk_image_no>CM006</disk_image_no> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <filesize>35654</filesize> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <filesize>35654</filesize>
            <filesize_unit>B</filesize_unit> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <file_creation_date>n/a</file_creation_date> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <file_accessed_date>n/a</file_accessed_date> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <md5_hash>976EDB782AE48FE0A84761BB608B1880</md5_hash> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <restricted>False</restricted> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <restricted>False</restricted>
            <access_rights>Public</access_rights> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <medium>5.25 inch Floppy Disks</medium> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <title>The Burgess Shale and the Nature of History </title> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <filetype>WordPerfect 4.2</filetype> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <duplicate_File> </duplicate_File> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <title>The Burgess Shale and the Nature of History </title>
            <filetype>WordPerfect 4.2</filetype>
            <duplicate_File> </duplicate_File>
            <export_path>files\BU3A5</export_path> &nbsp;&nbsp;&nbsp; </file>

    ...


        </file>

    Equivalent to EAD <archdesc><title>

    Information from: Disk Image FTK xml // CMnnn.001.txt

    maps to

    notes

    <collection_title>Stephen J. Gould Papers

    Collection object
       descMetadata
          <mods:title>
    Series & item objects
       descMetadata
          <mods:location> (1)

    <series>Series 6: Born Digital Materials

    Series set object
       descMetadata
          <mods:title>
    Item objects
       descMetadata
          <mods:location> (1)

    Equivalent to EAD series <c><unittitle>

    <note>5.25 inch Floppy Disks</note>

    Series object
       descmetadata
          <mods:physicalDescription>
              <mods:extent>

    Corresponds to EAD <physdesc>
    Value will be the same as "Medium" at the file level.

    <callnumber>M1437</callnumber>

    Collection object
       descMetadata
          <mods:identifier type="unitid" displayLabel="Call Number:">

    Corresponds to EAD <archdesc><did><unitid>

    Information from: FTK xml // Report_transformed.xml

    maps to (all within item objects)

    notes

    Report_transformed.xml

    maps to (within item objects)                                  

    notes

    <filename>BU3A5</filename>

    descMetadata
       <mods:identifier type="filename">BU3A5</mods:identifier>

    <filename>BU3A5</filename>

    n/a

    this is the original file name as it appeared on the original media.

    <Item_Number>1004</Item_Number>

    n/a


    descMetadata
      <mods:identifier type="ftk_id">1004</mods:identifier>

    internal FTK internal reference only, to disambiguate reference references in the FTK report

    <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b6194748-7aa8-4b9a-ae6c-9c3d12b6f23d"><ac:plain-text-body><![CDATA[

    <filepath>CM006.001/NONAME [FAT12]/[root]/BU3A5</filepath>

    descMetadata
      original file in FTK xml // files
    ]]></ac:plain-text-body></ac:structured-macro>
    display derivatives in Display Derivatives named using <item_number>
    <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="56be06b6-f6ff-42bd-a18f-57ab4d837700"><ac:plain-text-body><![CDATA[take object filepath for fully qualified object filename from portion after [root], up to but not including the final filename token

    ]]></ac:plain-text-body></ac:structured-macro>

    <mods:location>
         <mods:physicalLocation type="filepath">CM006.001/NONAME [FAT12]/[root]/BU3A5</mods:physicalLocation>
      </mods:location>

    location of file on original media

    everything after [root] can be taken as the fully qualified filename

    <disk_image_no>CM006</disk_image_no>

    Naomi thinks this should be handled with a link to the media (disk image) object

    RELS-EXT
      <isMemberOf xmlns="info:fedora/fedora-system:def/relations-external#" rdf:resource="info:fedora/hypatia:(id for media object CM006)"/>

    descMetadata
       <mods:location> (1)

    This token, taken from the head of the <filepath>, is the only data link between the FTK output for a file object and the corresponding media object. We want a data link in descriptive metadata as well as an RDF link to the corresponding object.

    <filesize>35654</filesize>

    descMetadata - human friendly
      <mods:physicalDescription>
         <mods:extent>35654</mods:extent>
       </mods:physicalDescription>

    contentMetadata - for machine
      <contentMetadata>
              <resource>
                <file size="35654"

    This should be a human friendly version of the file size.  The machine friendly version is in contentMetadata.

    Could be used by conversion to compare against the file size as computed locally, a quick check prior to checksum validation?

    <filesize_unit>B</filesize_unit>  

    use to determine filesize in bytes (convert to bytes if nec)

    Needed to correctly interpret <filesize>, if used

    <file_creation_date>n/a</file_creation_date>
    note?

    descMetadata
      <mods:originInfo>
         <mods:dateCreated>n/a</mods:dateCreated>
      </mods:originInfo>

     

    <file_accessed_date>n/a</file_accessed_date>
    note?

    descMetadata
      <mods:originInfo>
         <mods:dateOther>n/a</mods:dateOther>
      </mods:originInfo>

     

    <file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date>
    note?

    descMetadata
      <mods:originInfo>
        <mods:dateModified>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)<\mods:dateModified>
    </mods:originInfo>

     

    <MD5_Hash>976EDB782AE48FE0A84761BB608B1880</MD5_Hash>

    contentMetadata
      <file>
          <checksum type="md5">976EDB782AE48FE0A84761BB608B1880</checksum>
      </file>

    Used for checksum validation of a file during processing. This value will eventually be part of contentMetadata, but probably not as a value transferred from here.

    <restricted>False</Restricted>

     

    true=visible staff only, not discoverable .... Hypatia only

    <label name="Medium">5.25 inch Floppy Disks</label>

     

    Part of <location> (1)

    <label name="Type">Books</label>

     

    tag


    <type>Books</type>

    Naomi sez:  mods doc says this is controlled vocab, so this won't work ... [http://www.loc.gov/standards/mods/mods-outline.html#typeOfResource
    ]
    descMetadata
      <mods:typeOfResource>Books</mods:typeOfResource>


    <title>The Burgess Shale and the Nature of History</title>

    descMetadata
       <mods:relatedItem displayLabel="Appears in" type="host">
          <mods:titleInfo>
             <mods:title>The <label name="Title">The Burgess Shale and the Nature of History</label>

    descMetadata
       <mods:title>

    mods:title>
          </mods:titleInfo>
        <mods:relatedItem>

    This is not the title of the file or the file content directly, but the author's title to which the file relates.  

    <filetype>WordPerfect 4.2</filetype>

    note? descMetadata
       <mods:note displayLabel="filetype">

     

    <Duplicate_File> </Duplicate_File>

     

    * blank, null value or empty string - original file , not a duplicate is unique in collection, no duplicates
    * "Primary" - possibly indicates Primary file to keep/store
    * "SecondaryM" - The main file in a duplicate relationship. Neither better nor worse than the duplicate file, but simply the file examined first.
    * "D" - indicates a duplicate file to be ignored
    --> ignore for now .

    Note that this is content duplication based on having the same checksum (name conflicts are different and handled another way). The two files may or may not have the same name.  It is desirable to have a note and/or relationship in each record indicating the presence of a duplicate file in the collection. Details tbd.

    <export_path>files\BU3A5.wp</export_path>

     

    The file as available for the DOR object. Note it may have a file extension added by FTK. saved by FTK for further processing.

    (implied)

    RELS-EXT
       isMemberOf

    (see also  <disk_image_no>CM006</disk_image_no> )

    A link to the Media object

    (1) Location/container information -- for every file object created, create a <mods:location> description that places the resource in the context of the collection by combining collection name, intermediate series/group/etc name(s), and the ID+description of the media on which the file resides, :

         <location>collection-title - series-title - media-title (media-description)</location

    e.g.,

               <location>Stephen J. Gould Papers - Series 6: Born Digital Materials - CM006 (5.25 inch Floppy Disks)

    Use this concept for objectLabel?

    September 17 Update