Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

...

Section
CM001.001\[.dd\]
  • CM001.001.csv
  • CM001.001.txt
  • (etc)
  • Display Derivatives
    • {filename}.htm
  • EAD
  • FTK xml
    • files
      • {filename}
    • Report_transformed.xml
    • Report.fo
    • Report.xml
  • Note that the first 2 directories map to objects describing the physical media and will be the source of creating the "unprocessed" collection, while Display Derivatives and FTK files map to individual file content & description and will be used to create the "processed" collection.

    Column
    width50%

    The Stanford Stephen J. Gould collection will be used to describe the initial implementation of this process.  It is believed to represent a template for similar work with other born digital collections going forward.

    The directory output for the Gould collection contains the EAD and the content and metadata files for both Media and file objects (irrelevant files not shown):

    • M1437 Gould
      • Computer Media Photo
        • CM001.jpg
        • (etc)
      • Disk Image
    Wiki Markup
    Column
    width50%

    The Import/conversion process will produce this arrangement of media and file objects in DOR. These are shown in bold; we are not concerned here how the Collection and Series nodes themselves get created, nor how much of the rest of the EAD is also represented as objects:

    • Collection object
      • Series set -- Series 1 ..."
      •    :
      • Series set -- "Series 6: Born Digital Materials"
        • Media object 1
          • File object 1
          • File object 2
               :
        • Media object 2
        •    :

    Any intellectual arrangement of this information -- categorization into genres (correspondence, novels, etc) for instance, or tracing iterations of a work across devices, etc -- will be a separate process of augmenting the descriptive metadata for these objects.

    ...

    • an "item" -- it represents a unit of meaning and has content "parts" as separate objects
    • a "set" -- it has object related to it as members ... should we consider a specialized relationship for this?

    Sample of the starting lines of the .txt file describing the media object.

    ...

    Sample of transformed FTK file (though the converison modules will not use this as input, but rather work from the same mapping rules that express this output).

    <ftk_report xmlns:fo="http://www.w3.org/1999/XSL/Format"> &nbsp;&nbsp;&nbsp; <series>Series
        <series>Series 6: Born Digital Materials</series> &nbsp;&nbsp;&nbsp;
        <collection_title>Stephen Jay Gould papers</collection_title> &nbsp;&nbsp;&nbsp;
        <callnumber>M1437</callnumber> &nbsp;&nbsp;&nbsp; <file> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <filename>BU3A5</filename> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
        <file>
            <filename>BU3A5</filename>
            <item_number>1004</item_number> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <filepath>CM006.001/NONAME \ [FAT12\]/\[root\]/BU3A5</filepath> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <disk_image_no>CM006</disk_image_no> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <filesize>35654</filesize> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <filesize>35654</filesize>
            <filesize_unit>B</filesize_unit> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <file_creation_date>n/a</file_creation_date> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <file_accessed_date>n/a</file_accessed_date> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <md5_hash>976EDB782AE48FE0A84761BB608B1880</md5_hash> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <restricted>False</restricted> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <restricted>False</restricted>
            <access_rights>Public</access_rights> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <title>The Burgess Shale and the Nature of History </title> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <filetype>WordPerfect 4.2</filetype> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <duplicate_File> </duplicate_File> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
            <title>The Burgess Shale and the Nature of History </title>
            <filetype>WordPerfect 4.2</filetype>
            <duplicate_File> </duplicate_File>
            <export_path>files\BU3A5</export_path> &nbsp;&nbsp;&nbsp;
        </file>

    Panel

    Wiki Markup

    Information from: FTK xml // Report_transformed.xml

    maps to (within item objects)                                  

    notes

    <filename>BU3A5</filename>

    descMetadata
       <mods:identifier type="filename">BU3A5</mods:identifier>

    this is the original file name as it appeared on the original media.

    <Item_Number>1004</Item_Number>

    descMetadata
      <mods:identifier type="ftk_id">1004</mods:identifier>

    internal FTK reference only, to disambiguate references in the FTK report

    <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="d72ac7a5-46da-46a7-aa66-4394ac59669c"><ac:plain-text-body><![CDATA[

    <filepath>CM006.001/NONAME [FAT12]/[root]/BU3A5</filepath>

    descMetadata
    ]]></ac:plain-text-body></ac:structured-macro>
      <mods:location>
    <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="0dc692f5-a91e-46f1-aaee-fd54d1c86473"><ac:plain-text-body><![CDATA[      <mods:physicalLocation type="filepath">CM006.001/NONAME [FAT12]/[root]/BU3A5</mods:physicalLocation>
    ]]></ac:plain-text-body></ac:structured-macro>
      </mods:location>

    location of file on original media
    <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="b61200e6-0554-4d2a-9f87-f0c429a39baa"><ac:plain-text-body><![CDATA[
    everything after [root] can be taken as the fully qualified filename
    ]]></ac:plain-text-body></ac:structured-macro>

    <disk_image_no>CM006</disk_image_no>

    Naomi thinks this should be handled with a link to the media (disk image) object

    RELS-EXT
      <isMemberOf xmlns="info:fedora/fedora-system:def/relations-external#" rdf:resource="info:fedora/hypatia:(id for media object CM006)"/>

    descMetadata
       <mods:location> (1)

    This token, taken from the head of the <filepath>, is the only data link between the FTK output for a file object and the corresponding media object. We want a data link in descriptive metadata as well as an RDF link to the corresponding object.

    <filesize>35654</filesize>

    descMetadata - human friendly
      <mods:physicalDescription>
         <mods:extent>35654</mods:extent>
       </mods:physicalDescription>

    contentMetadata - for machine
      <contentMetadata>
              <resource>
                <file size="35654"

    This should be a human friendly version of the file size.  The machine friendly version is in contentMetadata.

    Could be used by conversion to compare against the file size as computed locally, a quick check prior to checksum validation?

    <filesize_unit>B</filesize_unit>

    use to determine filesize in bytes (convert to bytes if nec)

    Needed to correctly interpret <filesize>, if used

    <file_creation_date>n/a</file_creation_date>

    descMetadata
      <mods:originInfo>
         <mods:dateCreated>n/a</mods:dateCreated>
      </mods:originInfo>

     

    <file_accessed_date>n/a</file_accessed_date>

    descMetadata
      <mods:originInfo>
         <mods:dateOther>n/a</mods:dateOther>
      </mods:originInfo>

     

    <file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date>

    descMetadata
      <mods:originInfo>
        <mods:dateModified>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)<\mods:dateModified>
    </mods:originInfo>

     

    <MD5_Hash>976EDB782AE48FE0A84761BB608B1880</MD5_Hash>

    contentMetadata
      <file>
          <checksum type="md5">976EDB782AE48FE0A84761BB608B1880</checksum>
      </file>

    Used for checksum validation of a file during processing. This value will eventually be part of contentMetadata, but probably not as a value transferred from here.

    <restricted>False</Restricted>

     

    true=visible staff only, not discoverable .... Hypatia only

    <type>Books</type>

    Naomi sez:  mods doc says this is controlled vocab, so this won't work ... [http://www.loc.gov/standards/mods/mods-outline.html#typeOfResource
    ]
    descMetadata
      <mods:typeOfResource>Books</mods:typeOfResource>


    <title>The Burgess Shale and the Nature of History</title>

    descMetadata
       <mods:relatedItem displayLabel="Appears in" type="host">
          <mods:titleInfo>
             <mods:title>The Burgess Shale and the Nature of History</mods:title>
          </mods:titleInfo>
        <mods:relatedItem>

    This is not the title of the file or the file content directly, but the author's title to which the file relates.

    <filetype>WordPerfect 4.2</filetype>

    descMetadata
       <mods:note displayLabel="filetype">

     

    <Duplicate_File> </Duplicate_File>

     

    * blank, null value or empty string - file is unique in collection, no duplicates
    * "M" - The main file in a duplicate relationship. Neither better nor worse than the duplicate file, but simply the file examined first.
    * "D" - indicates a duplicate file.

    Note that this is content duplication based on having the same checksum (name conflicts are different and handled another way). The two files may or may not have the same name.  It is desirable to have a note and/or relationship in each record indicating the presence of a duplicate file in the collection. Details tbd.

    <export_path>files\BU3A5.wp</export_path>

     

    The file as saved by FTK for further processing.

    (implied)

    RELS-EXT
       isMemberOf

    (see also  <disk_image_no>CM006</disk_image_no> )

    A link to the Media object

    ...