Overview
Hypatia has the goal of effectively managing born-digital components of an archival collection, collections traditionally described by Finding Aids in the form of EADs (Encoded Archival Descriptions). There is an established body of practice describing the physical components of such collections at many levels of detail, including physical artifacts (e.g., diskettes) associated with born digital materials. Describing, and delivering, digital content itself faces many challenges. The phrase "unprocessed collection" when applied to born digital materials means there is description only down to the physical media itself, with a requirement to associate digital artifacts, like a raw binary disk image and photos, to that description. The phrase "processed collection" means there is more detailed description, down to the file level, supporting an intellectual arrangement of the content as well as access to the individual files.
This analysis addresses both processed and unprocessed collections, assuming the Hypatia solution of mapping an EAD into Hydra/Fedora objects, then managing born-digital content as Fedora objects through the Hypatia interface.
Processing born-digital archival materials into Hypatia involves several discrete considerations.
- Translating an EAD that describes an archival collection as a whole into well-formed Hydra/Fedora objects
- Creating the set object that represents the entire collection
- Creating intermediate set objects that represent the EAD hierarchy
- Creating "item" objects that represent nodes in the hierarchy that describe individual objects
- Translating item level information into well-formed (per Hydra) Fedora objects
- Required datastreams and child objects
- EAD to MODS mapping
- Address the born digital materials
- Recognizing what in the EAD describes "Born digital" content
- Enriching the digital objects with file-level information
EAD hierarchical structure mapping and content mapping
This section addresses how EADs map into Hypatia set and item objects. See Hypatia content and layout for Sets and Items for more details.
Skeletal structure of sample EADs:
Site | Collection | EAD structure / location of born digital materials [type of hydra object] | notes |
---|---|---|---|
Hull | Gallagher | collection [set] |
|
Hull | Socialist Health Assoc. |
|
|
Stanford | Xanadu | collection [set] | 1. Target "born digital" sub-level identified by <unittitle> |
Stanford | Gould | collection -- unittitle "Stephen Jay Gould papers" unitid: M1437 [set] | EAD only goes down to the single "Born digital" series description, with no details expressed at lower levels. A rationalized directory structure and FTK output are intended to support a direct translation into Hypatia objects for both unprocessed and processed views without an intermediary EAD. |
Virginia | Cheuse |
|
|
Yale | Conn. Oral Histories |
|
|
Yale | Love Makes a Family |
|
|
Yale | Pelli |
|
|
Yale | Tobin | collection [set] | 1. Target sub-level identified by <unittitle> |
Yale | Turner |
|
|
Yale | Welch |
|
|
Assumptions:
- The entire EAD will be mapped into Hypatia digital objects representing both digital content and their physical and intellectual arrangement as expressed in the EAD hierarchy.
- Non born-digital content
- The non-born-digital portions of the EAD are converted to Hypatia objects to provide context to the born-digital objects.
- The ongoing relationship of the those containing parts of the EAD as managed in Archivist Toolkit or elsewhere -- if or when Hypatia assumes management of that information as well -- is outside the scope of this document.
- Born-digital content
- Hypatia is intended to be the source of this information and the place where it is managed
- Full (re)construction of an EAD from the Hypatia information is not an explicit goal
More assumptions:
- The EAD header and front matter (publication statement, EAD profile, etc) will not be converted, nor will the slim <dsc> and optional <head> information that typically separates the Collection description from levels below.
- the entire EAD will be copied as an artifact associated with the Collection.
- The rest of the EAD, from <archdesc> down, will be consumed in its entirety and converted to a set of Fedora objects corresponding to each level, down to the lowest node.
- terminal nodes are treated as digital objects, an "item" in Hydra and Stanford terms
- The vocabulary of "levels" is remembered but otherwise irrelevant to the interpretation of collection, sets and items unless they explicitly aid in the mapping for a specific collection or site.
Sets vs items
A "collection" in Hypatia is the primary set established by the information at the initial <archdesc> level of the EAD, regardless of its "level" designation, e.g.,
- collection -- Stanford, Virginia, Yale
- fonds -- used by Hull
Intermediate levels defined by the hierarchical arrangement of <c> or <c0n> tags form a structural hierarchy of Hypatia sets. The "level" vocabulary, while remembered, is not itself significant except as a aid in determining the boundaries between sets and individual Hypatia objects. In practice, they might be any one of
- class
- recordgrp
- series
- subfonds
- subgrp
- subseries
- file
- otherlevel
In general, "item" level entries in an EAD will map to individual Hypatia objects. The "file" level" may also be considered an item node for a specific EAD. Unless otherwise indicated for a specific conversion, other levels will translate to set objects in Hypatia, even if they are empty sets because the EAD had no item-level description.
Stanford FTK-backed Digital Object creation
Stanford uses the Forensic Toolkit (FTK) software to analyze and characterize the contents of computer media. Starting with the Gould collection, we will only provide a single series node in the EAD to represent "Born Digital Materials". Conversion routines will be able to auto-generate objects representing the unprocessed collection (the media artifacts themselves, e.g., hard drives and floppy discs) as well as detailed file content objects from a modified form of the FTK output. See Stanford FTK to Hypatia object mapping
EAD-to-MODS - general information
There is a scarcity, dare I say dearth, of tools available to do this mapping or to offer an existing implemented conversion. So the mappings here are based on data encountered in the Hypatia sample EADs and can be augmented as we go along. They are informed by two sources:
- DLF Aquifer Guidelines for Shareable MODS Records: EAD to Aquifer MODS Crosswalk
- Bountouri, L., & Gergatsoulis, M. Interoperability Between Archival and Bibliographic Metadata : An EAD to MODS Crosswalk, 2009.
Assumption: Working assumption is that all descriptive metadata, for collections, intermediate sets (levels) and digital objects will be MODS.
Issue: the mapping from EAD to MODS is not perfect and is not fully reversible:
- The original EAD header, while preserved along with the original EAD, is not part of the makeup of the resulting Digital Objects and would not change.
- While no metadata information should be lost, it may not be mappable to original form of expression if multiple styles or patterns are reconciled into one, e.g., implicit vs explicit labels based on <head> elements.
- There are artifacts of EAD markup, such as qualifying element attributes, that have no corresponding place in MODS and will not be brought over unless further accommodation is made.
Issue: The EAD schema makes extensive use of complex XML types with mixed content. This is a pattern where an XML element contains a free mix of free text and other sub-elements.
- They can be used as a form of entity markup, strongly typing references within a longer block of text:
example
as rendered in browser
from
issue
action
<titleproper>Stephen J. Gould papers
<num>M1437</num>
</titleproper>Stephen J. Gould papers M1437
Stanford/Gould
entity markup disappears for display; would be visible and viable for editing?
Strip embedded markup
<langmaterial label="Language(s):">Chiefly in <language langcode="eng" scriptcode="Latn">English</language>; some materials in
<language langcode="fre" scriptcode="Latn">French</language>.</langmaterial>Chiefly in English; some materials in French.
Yale/Welch
ibid
Strip embedded markup
<unittitle>
<title render="italic">The Panda's Thumb</title>, galley proof, Chapters 22-31
</unittitle>, galley proof, Chapters 22-31
Stanford/Gould
<title> tag sets browser window title; is ignored as part of overall text
Strip out embedded <title> markup
- Complex elements in EADs can also be used for display markup:
tag
example
as rendered in browser
found in
issue
action
<p>
<scopecontent><p>Original series of 4 episodes ...</p>
<p>SG was series creator and writer ...</p>
<p>Feature-length pilot and series opener ...</p></scopecontent>Original series of 4 episodes ...
SG was series creator and writer ...
Feature-length pilot and series opener ...everywhere
Works great, but embedded markup is not desirable
Drop initial <p> and trailing </p>; otherwise retain <p> markup for short term convenience? It would have to be encoded (e.g., <) and reinterpreted on output.
<head>
<bioghist id="ref141">
<head>Biography</head>
<p>When five-year-old Stephen Jay Gould ....</p>Biography
When five-year-old Stephen Jay Gould ...everywhere
Heading displayed with text; treating them as labels is preferred
Turn <heading> into displayLabel attribute in corresponding MODS fields where possible.
<blockquote>
none so far
<emph>
<unittitle>Yale University
<emph render="smcaps">(restricted until January 1, 2024)</emph>
</unittitle>Yale University (restricted until January 1, 2024)
Yale (numerous)
non-html markup, ignored/lost
strip out?
<list>
<arrangement id="ref7">
:
<list type="ordered">
<item>
<ref target="ref11" ns2:type="simple" ns2:actuate="onRequest" ns2:show="replace">Inventory</ref>
</item>
<item>
<ref target="ref92" ns2:type="simple" ns2:actuate="onRequest" ns2:show="replace">Accession 2003-M-005</ref>
</item>
<item>
<ref target="ref123" ns2:type="simple" ns2:actuate="onRequest" ns2:show="replace">Accession 2004-M-088</ref>
</item>
</list>
</arrangement>Inventory Accession 2003-M-005 Accession 2004-M-088
Virginia:Cheuse
<frontmatter>
Yale:Tobin
<archdesc>non-html markup, ignored/lost
Convert data to comma separated list
<table>
<table frame="none">
<tgroup cols="3">
<colspec colnum="1" colname="1" align="left" colwidth="50pt"/>
<colspec colnum="2" colname="2" align="left" colwidth="50pt"/>
<thead>
<row>
<entry colname="1">Family Member</entry>
<entry colname="2">Spouse</entry>
</row>
</thead>
<tbody>
<row>
<entry colname="1">John Albee</entry>
<entry colname="2">Mary Delaney</entry>
</row>
</tbody>
</tgroup>
</table>Family Member Spouse John Albee Mary Delaney
none (example from EAD site)
non-html markup, ignored/lost
convert to html <table>?
(defer until encountered?)
See EAD specs for tabular display<address>
<repository label="Repository:">
<corpname>Manuscripts and Archives</corpname>
<address>
<addressline>Sterling Memorial Library</addressline>
<addressline>128 Wall Street</addressline>
<addressline>P.O. Box 208240</addressline>
<addressline>New Haven, CT 06520</addressline>
<addressline altrender="email">Email: mssa.faq@yale.edu</addressline>
<addressline altrender="phone">Phone: (203) 432-1735</addressline>
<addressline altrender="fax">Fax: (203) 432-7441</addressline>
</address>
</repository>Manuscripts and Archives Sterling Memorial Library 128 Wall Street P.O. Box 208240 New Haven, CT 06520 Email: mssa.faq@yale.edu Phone: (203) 432-1735 Fax: (203) 432-7441
Stanford
(frontmatter)
Yale
(archdesc)ignore <address> in initial conversion
<bibref>
<bibliography encodinganalog="3.5.4">
<bibref>HH Eckstein, The English health service (Harvard, 1959)
JE Pater, The making of the National Health Service (London, 1981)
John Stewart (1878-1967), Oxford Dictionary of Biography, Oxford, 2004</bibref>
</bibliography>HH Eckstein, The English health service (Harvard, 1959) JE Pater, The making of the National Health Service (London, 1981) John Stewart (1878-1967), Oxford Dictionary of Biography, Oxford, 2004
Hull:Socialist
<frontmatter>Implied line breaks are ignored/lost
Defer; not in converted data
<title>
<unittitle>
<title render="italic">The Panda's Thumb</title>, galley proof, Chapters 22-31
</unittitle>, galley proof, Chapters 22-31
Stanford:Gould
(numerous)
Virginia:Cheuse
(numerous)
Yale:(several)
(numerous)<title> tag sets browser window title; is ignored as part of overall text
Strip out embedded <title> markup
We will refer below to this set of embedded-element set of translations as "embedded element conversion".
Note that for Stanford, a lossy translation is not an issue as long as affected parts of the EAD are still sourced and maintained externally, e.g., in Archivist Toolkit. Eventually a transition away from the EAD support for markup will have to be addressed.
Issue: Tags that have no mapping into MODS
With one exception, we will map these into Notes, using displayLabel to let them appear with specific labels in the Hypatia display.
- <scopecontent> -- map to MODS <abstract> per DLF Guidelines.
- <bioghist> -- map to MODS <note>
- <custodhist> -- map to MODS <note>
- <relatedmaterial> -- map to MODS <note>
- <otherfindaid> -- map to MODS <note>
- <bibliography> -- map to MODS <note>
- <processinfo> -- map to MODS <note>
Conversion rule (Stanford): Use of <head> at the beginning of text fields as a labeling convention ...
Code Block |
---|
<scopecontent id="ref13"> <head>Collection Scope and Content Summary</head> <p>The collection includes files from XOC, VHS tapes, and Drexler drafts and galley proofs.</p> </scopecontent> <bioghist id="ref11"> <head>Biography</head> <p>Keith Henson and his wife Arel Lucas founded XOC (Xanadu Operating Company).</p> </bioghist> |
Issue: Tag attributes that have no mapping into MODS
These are numerous and will not be enumerated in full. Some examples:
Panel |
---|
<repository encodinganalog="3.1.2">Hull University Archives</repository> <unitid encodinganalog="3.1.1" label="Reference" countrycode="GB" repositorycode="50">U DGA/1/2/5/a</unitid> <unitid label="Call Number:" countrycode="US" repositorycode="US-CtY">MS 1746</unitid> <physdesc encodinganalog="3.1.5" label="Extent"> <accessrestrict id="ref5"> ... <origination label="creator"> <persname rules="aacr" source="naf">Shearer, Rhonda Roland, 1954- </persname> <unitdate normal="1951/1996" type="inclusive" calendar="gregorian" era="ce">1951-1996</unitdate> <note type="bpg"> |
Conversion rule: Attributes not specifically targeted for conversion will be ignored/lost.
Issue: retaining ref and level information, do these map to appropriate container descriptions?
Issue: "otherlevel" levels -- <c level="otherlevel" otherlevel="SubSeries"> (Hull)
Issue: Stanford <container> conventions and mapping into a MODS "Location" note (revised 10/24/11 to split out Collection title in item record and nest this information in a relatedItem):
We will create a concise representation of the physical/logical location (as appropriate) of the materials in the context of the collection and its hierarchy. It will be a MODS <relatedItem><physicalLocation type="location">. It will be a concatenation of the following information:
- Series and subseries names etc if present -- e.g., Series 6: Born Digital Materials
- The container type (box, map case, etc) and ID -- e.g., Box 11
- A sub-container type + value, down to the level of the item -- e.g., Folder 3
Assembles as "Series 6: Born Digital Materials - Box 11 - Folder 3"
Is this generalizable, across Stanford collections? across institutions?
Examples:
Collection | EAD | MODS |
---|---|---|
Gould | <c id="ref432" level="file"> | <mods:relatedItem type="host"> |
Hensen | <c id="ref50" level="item"> | <mods:relatedItem type="host"> |
Issue: Derived <mods:location> information
Where all items objects are derived from FTK information about files in a directory, how is this logical_physical location information assembled and presented?
Collection | FTK | MODS |
---|---|---|
Gould |
| <mods:relatedItem type="host"> |
Issue: Recursively nested <descgrp>
Panel |
---|
Virginia, Yale: <descgrp id="ai" type="admininfo"> |
So far, other, more complex examples have not been found in the samples, e.g., nested <bioghist> to partition a biography with separate headings.
Conversion: Ignore the wrapping <descgrp> of type="admininfo"
EAD-to-MODS mapping for an individual item
Note: this specifies the conversion from EAD metadata to MODS for an individual item. conversion should use a "mods" namespace declaration and qualified tags, e.g.,
Code Block | ||||
---|---|---|---|---|
| ||||
<mods:mods xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="3.3" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd"> <mods:titleInfo> <mods:title>Keith Henson. Papers relating to Project Xanadu, XOC and Eric Drexler</mods:title> </mods:titleInfo> |
EAD element | MODS element | Notes | Example |
---|---|---|---|
<unittitle> | <titleInfo> | • Requires embedded element conversion |
|
<origination> | <name type="..."> | • EAD/persname maps to MODS <name type="personal"> | <origination label="creator"> |
<repository> | <name> | Map <repository><corpname> to | <repository> |
No corresponding EAD element | <typeOfResource> | For any Hypatia set created, create an entry indicating a collection. | <mods:typeOfResource collection="yes"/> |
<controlaccess> | <genre> | • EAD origination source attribute maps to MODS/genre authority attribute | <controlaccess> |
<unitdate> | <originInfo> | If only one <unitdate> is present for a <did>, add attribute keydate="yes". If more than one <unitdate>, only add keydate="yes" if EAD type="inclusive". | <mods:originInfo> |
<langmaterial> | <language> | For <langmaterial> | <langmaterial label="Language(s):">The materials are in <language langcode="eng" scriptcode="Latn">English</language>.</langmaterial> |
No corresponding EAD element | <physicalDescription> | Add a "born digital" indication only for the born digital items in the collection, else omit. | <mods:physicalDescription> |
<physdesc> | <physicalDescription> | • Each EAD <extent> subelement will become a MODS/extent element | <physdesc> |
<abstract> or <scopecontent> | <abstract> | Map EAD label attribute to MODS displayLabel attribute | <abstract label="Summary:">The papers consist of correspondence, subject files, and writings, primarily documenting the professional career and personal life of James Tobin as an economist and educator.</abstract> |
<descgrp> | <note> | • Requires embedded element conversion
| <prefercite id="ref6"> |
<arrangement> | <tableOfContents> | Mapping per DLF guidelines, with default displayLabel of "Arrangement". | <arrangement id="ref206"> |
No corresponding EAD element | <targetAudience> | mapping not applied to sample EADs |
|
<odd> | <note> | not found in sample EADs |
|
<controlaccess> with | <subject> with | Mappings of EAD <controlaccess> subelements to MODS's <subject> subelements: | <controlaccess> |
No corresponding EAD element | <classification> | No mapping in samples |
|
No corresponding EAD element | <relatedItem> | No mapping in samples |
|
<unitid> | <identifier> | • All mapped to identifier of type=unitid | <unitid>M1437</unitid> |
No corresponding EAD element | <location><url> | No candidate sample data, through conversions could provide useful additions for born digital materials |
|
<accessrestrict> | <accessConditions> | • Requires embedded element conversion | <accessrestrict id="ref5713"> |
<userestrict> | <accessCondition> | • Requires embedded element conversion | <userestrict id="ref5"> |