Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Background

Panel

Born-digital collections present a challenge to traditional ways of describing content and making it discoverable. Disks, drives, directories, etc. may contain many thousands of files which would be difficult to describe in detail in an EAD. The approach described here starts with an EAD that contains only a single reference to the born-digital portion of a collection, expressed as a specific Series within the larger collection. The Forensic Toolkit (FTK) software is used to produce both disk images and related analysis files, plus detailed technical information about individual files on the media. The output of this process is transformed and enhanced and eventually transformed into a set of digital objects in DOR (Stanford's Fedora-based Digital Object Registry) as well as shared objects for Hypatia.

A separate process is used for converting the Collection EAD into metadata objects representing the full context of the born-digital and other materials, with links made between the containers and the detailed objects.

Section
Column
width50%

Stanford directory output for Gould collection contains the EAD and the content and metadata files for both Media and file objects:

  • M1437 Gould
    • Computer Media Photo
      • CM001.jpg
      • (etc)
    • Disk Image
      • CM001.001
      • CM001.001.csv
      • CM001.001.txt
      • (etc)
    • Display Derivatives
      • {filename}.htm
    • EAD
    • FTK xml
      • files
        • {filename}
      • Report.fo
      • Report.xml
      • Report_transformed.xml
      • Disk Image

Note that the first 2 directories map to objects describing the physical media and will be the source of creating the "unprocessed" collection, while DIsplay Display Derivatives and FTK files  files map to individual file content & description and will be used to create the "processed" collection.

Column
width50%

The Import/conversion process will produce this arrangement of objects in DOR:

  • Collection object
    • Series set -- Series 1 ..."
    •    :
    • Series set -- "Series 6: Born Digital Materials"
      • Media object 1
      • Media object 2
      •    :
      • File object 1.1
      • File object 1.2
      •    :
      • File object 2.1
      • File object 2.2
      •    :

Note that even though the files originate on specific media, the "Media" objects are not sets in the DOR/Hydra sense of simple object aggregation. Instead, the file/media relationship is considered just one of many possible intellectual arrangements that can be expressed in metadata. A RELS-EXT relationship (hydra:isLocatedOn, onSourceMedia???) and a MODS <location> (for humans) express the file and media relationship, allowing this logical view:

  • Series set -- "Series 6: Born Digital Materials"
    • Media object 1
      • File object 1
      • File object 2
      •    :
    • Media object 2
    •    :

... while a simple <type> that maps to descriptive metadata designates the primary intellectual arrangement descriptive elements like <type> allow other logical arrangements of the files based on the nature of their content.

...

Information from: FTK xml // Report_transformed.xml

maps to (all within item objects)

notes

<filename>BU3A5</filename>

n/a

this is the original file name as it appeared on the original media.

<Item_Number>1004</Item_Number>

n/a

internal reference only, to disambiguate reference in the FTK report

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="2f2d9faaddb08faf-8d93600f-43d94104-944d9974-e6b3d5f5f9f446c43e003d14"><ac:plain-text-body><![CDATA[

<filepath>CM006.001/NONAME [FAT12]/[root]/BU3A5</filepath>

 

original file in FTK xml // files
]]></ac:plain-text-body></ac:structured-macro>
display derivatives in Display Derivatives named using <item_number>

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="1b0bb415bbecddba-723ee14c-4abc4244-b02b8334-4b04d9b1f23b04f518eb5848"><ac:plain-text-body><![CDATA[take object filepath for fully qualified object filename from portion after [root], up to but not including the final filename token

]]></ac:plain-text-body></ac:structured-macro>

<disk_image_no>CM006</disk_image_no>

descMetadata
   <mods:location> (1)

This token, taken from the head of the <filepath>, is the only data link between the FTK output for a file object and the corresponding media object. We want a data link in descriptive metadata as well as an RDF link to the corresponding object.

<filesize>35654</filesize>

 

Could be used by conversion to compare against the file size as computed locally, a quick check prior to checksum validation?

<filesize_unit>B</filesize_unit>

 

Needed to correctly interpret <filesize>, if used

<file_creation_date>n/a</file_creation_date>

note?

 

<file_accessed_date>n/a</file_accessed_date>

note?

 

<file_modified_date>12/8/1988 6:48:48 AM (1988-12-08 14:48:48 UTC)</file_modified_date>

note?

 

<MD5_Hash>976EDB782AE48FE0A84761BB608B1880</MD5_Hash>

 

Used for checksum validation of a file during processing. This value will eventually be part of contentMetadata

<restricted>False</Restricted>

 

true=visible staff only, not discoverable .... Hypatia only

<medium>5.25 inch Floppy Disks</medium>

 

Part of <location> (1)

<type>Books</type>

descMetadata
   <mods:subject>
      <mods:topic>

<topic? or <genre>?  authority?

<title>The Burgess Shale and the Nature of History</title>

descMetadata
   <mods:title>

 

<filetype>WordPerfect 4.2</filetype>

note?

 

<Duplicate_File> </Duplicate_File>

 

* blank, null value or empty string - original file, not a duplicate
* "Primary" - possibly indicates Primary file to keep/store
* "Secondary" - indicates a duplicate file to be ignored
--> ignore for now

<export_path>files\BU3A5.wp</export_path>

 

The file as available for the DOR object. Note it may have a file extension added by FTK.

...