Story: An object in Fedora-Thing represents an absorption spectrometer measurement. Fedora-Thing manages an image and a numerical dataset for the object. The object also has something analogous to an external datastream representing the current state of metadata description of the object; this is maintained in an external system.

In this external system, a researcher goes through the collection by date and timestamps, selecting those objects generated by the same experiment, and assigns to all of them an investigation identifier, a project identifier, and her name as creator, thus updating the pre-existing metadata record. Each experiment typically includes several hundreds to thousands of objects. When these data are requested through Fedora it retrieves them from the external system, constructing the URL as necessary similarly to the existing Fedora service URL substitutions.

After enriching the whole collection in that way, the researcher searches for absorption spectra with certain characteristics (wavelength and curvature of peaks), and tags all objects in the result set as “good measurement”. A result set may contain dozens to thousands of objects. The external system can have these changes archived as a version of the data by submitting a PUT to the relevant object-datastream resources, or it can POST a list of object-datastream resources to the repository resource to achieve the same in batch.

This arrangement may imply use of externally unique identifiers (eg UUIDs) and conventions for addressing services (eg ARK https://wiki.ucop.edu/display/Curation/ARK

 

This use case compiles stories from:

4 Comments

  1. Interesting. Storing the object in Fedora, but not the metadata. I would always do the exact opposite: make Fedora a linked data compatible metadata store and persist the data in appropriate data stores (e.g. a Hadoop cluster, Amazon Glacier, Data Warehouse). Storing a terabyte of research data per day in Fedora 3.6.x is an excercise I would rather try to avoid. In my eyes, Fedora should provide services to add an information layer on top of a ditributed set of highly specialized, yet dumb (from a standpoint of information management) data stores. Such services might include

    • metadata enrichment (augmenting "raw" objects that sit somewhere else with metadata)
    • provenance management
    • (distributed) authentication and authorization 
    • data directory (hide the complexity of the underlying distributed data stores) 
    • keeping track of changes to the data for a centralized audit trail/provenance information
    • expose object metadata as linked data

    I definitly see the need to uniquely identify resources in Fedora, and I very much embrace the idea of using resources of the semantic web to describe repository contents.I see the first as a prerequisite for the latter.. Both form the basis for the services sketched above.

    I am not sure that I need to serialize the object data and the augmented metadata in a single Fedora-thing. Even today, this is just an option, not a necessity in Fedora.

    1. Yes, there's no requirement that any of this be serialized in Fedora-thing. Even in current Fedora, we would use external datastreams for this object data.

      1. I agree.  The logical separation is simple objects (like files in a file system, S3 objects) that consist of a bytestream and a container of simple metadata.  Above that are rich objects (compound objects, relationships) that make sense of the simple objects.  In turn, the rich objects can be serialized into simple objects.  Both simple object and rich objects can be run through feature extraction (data and metadata) to populate indices to remove latency and help in other use cases. There are many implementation and lots of existing software support for such a fundamental architecture.

  2. This is a complex use case that will likely need further discussion. We need someone to "own" this use case so we can work on it over time.