OverviewThis section is for discussion and description of a proposed "high-level" storage interface. The notion of a new storage interface originated with use cases that were hard to satisfy using the current implementation in Fedora. Subsequent discussions have led to a realization that "storage" is a much larger issue with great potential both within Fedora and within a greater infrastructure. In this section you will find references to the earliest work with respect to Fedora and draft proposals that were aimed towards concrete implementations for the Fedora Repository. You can also find and participate in the wider discussion about "storage" that resulted from our initial attempts to design the Fedora-specific implementation and led to our beginning to understand this subject extends far beyond Fedora. In a nutshell, this forum aims to define an architecture for storage (a.k.a persistence) which is of general use and it suitable to support short and long-term access and persistence of digital assets regardless of the underlying physical mechanism. But also, there needs to be immediate implementation of new persistent components to meet current needs. This forum will also be used to design and facilitate implementation of usable components. To keep this page short, it will mostly consist of organizing links to other pages. Please note this work will cross link to other participants notable Fedora Create, the Data Conservancy, Policy Driven Repository Interoperability (PoDRI) in conjunction with iRODS, and others who will be named shortly. Use CasesThis section will lead to use case pages which inform the discussion both Fedora directed and use cases which are beyond Fedora's scope. GlossaryA common nomenclature is needed to facilitate understanding. It's best to use common terms but consistency within this discussion is more important if the term has multiple definition in common use. Documents
Issues for DiscussionIn a nutshell, this proposal aims to remove certain hard-coded storage assumptions in Fedora, and present a storage layer/interface that would allow a place for extensions to Fedora that implement multiplexing, non-blob storage, lock-free updates, cloud storage, etc.
Implementation planSince the storage layer in Fedora resides beneath the object management (DOManager) layer in Fedora, adopting HighLevelStorage implies creating an alternate DOManager instance that interacts with HighLevelStorage rather than ILowLevelStorage. Ideally, this alternate DOManager would be a simple drop-in replacement for the existing DefaultDOManager. Initial development of HighlevelStorage could then be be largely independent from the core Fedora code, deployment would be enabled by a simple configuration change. Unfortunately, this is not easily possible today due to unnecessary coupling between certain Fedora components, and an abundance of unrelated functionality in DOManager that can/should exist elsewhere. These issues would need to be addressed in order to create a truly pluggable DOManager. While HighLevelStorage is not scheduled to be a feature of Fedora 3.4, it may be developed concurrently or slightly after the 3.4 release date. As many of the prerequisites for drop-in replacement of DOManager are general improvements to the Fedora code base that are not storage-specific, there is distinct appeal to incorporating these basic improvements into the core in time for Fedora 3.4. With these prerequisites in place, work on HighLevelStorage may proceed entirely as an add-on/replacement module, hopefully without further changes to the core. Combined with Fedora's enhanced modular architecture, it would potentially allow HighLevelStorage to be distributed as an add-on bundle to Fedora 3.4 for evaluation or testing before it becomes a core feature. Relevant tracker itemsNeed to figure out how to link to the issues using the new Jira version! |
9 Comments
Matt Zumwalt
This has some overlap with the work at Library of Congress on an "Inventory Service". See their paper A Set of Transfer-Related Services in DLib Magazine Jan/Feb 2009.
Also - keep an eye to ways in which this might allow us to support transactionality -- at least for the content that we want to store in a a transactional system.
Asger Askov Blekinge
This is my suggestion for the java interface of the Fedora Object Model
Outstanding questions:
Relations, should they be properties or still live as datastreams? I feel they should be in a datastream, but the RELS-INT and RELS-EXT should be combined. Still, they are used so often, there should be support functions.
DataSource, it this the correct way to represent the contents of a datastream?
Are list of properties the correct way to go, or should we mention the specific properties, like SIZE, FORMAT_URI, STATE, LABEL and so on.
Mark Diggory
I like the idea of the RELS-EXT / RELS-INT... but wonder, if these are not "specific" to just "RELS" but to any RDF in general.
Per DigitalObject <-> DataStream <-> DataSource>
We ended up with DSpace StorageService as
StorageEntity = and Identifier
StorageProperty = a triple of <StorageEntity> <StoragePropertyName> <Object>
The PropertyStorageService is Property centric, the BinaryStorageService is Content centric
http://scm.dspace.org/svn/repo/modules/dspace-storage/trunk/api/src/main/java/org/dspace/services/PropertyStorageService.java
Asger Askov Blekinge
This is my proposal for the storage interfaces.
Nothing new here, just restating the ReadableStore interface
Here the fun begins
Issues to note:
Result: Information about what has happened and where the digital object is stored.
Asger Askov Blekinge
Transactions seemed to be a big thing to community. These are my current thoughts on that subject
The purpose of this interface is to adress stores, which can undo changes. Undo differs from update, in that undo should leave the repository in the same state as if the change was never done.
One or two interfaces, have not decided yet. The point is, one can begin a transaction, do a number of changes as part of that transaction, and commit or rollback the changes.
A asynch store should implement this interface. The Result of the three modifying operations will return a Result indicating that the writes are postponed. The status method allows you to use this Result Object to lookup what the current state is.
Mark Diggory
In the DSpace Service model, there is a certain degree of transactionality captured in the system. In DSpace request = transaction window, services can bind a listener to the request and cause completion of tasks to complete or rollback the transaction. All "services" in DSpace operate within this transactional window.
DSpace 2.0 Core Services
RequestService
In DS2 a request is the concept of a request (HTTP) or an atomic transaction in the system. It is likely to be an HTTP request in many cases but it does not have to be. This service provides the core services with a way to manage atomic transactions so that when a request comes in which requires mutliple things to happen they can either all suceed or all fail without each service attempting to manage this independently. In a nutshell this simply allows identification of the current request and the ability to discover if it succeeded or failed when it ends. Nothing in the system will enforce usage of the service but we encourage developers who are interacting with the system to make use of this service so they know if the request they are participating in with has succeeded or failed and take appropriate actions.
http://scm.dspace.org/svn/repo/dspace2/core/trunk/api/src/main/java/org/dspace/services/RequestService.java
Asger Askov Blekinge
Context
If the store should perform messaging, authorization or performing an audit trail, it needs to know about the context of the changes. For this reason, EVERY method should take an additional parameter, Context, probably equivalent to the current Fedora class of the same name
Mark Diggory
I'm always going to step in and talk about how far we got with the StorageService in DSpace 2.0 and the Backporting... Just to get some perspective out there to similar work on that side...
Original DSpace 2.0 Storage API by Arron Zeckoski
Read/Write interfaces, versioning, searching, etc.
http://scm.dspace.org/svn/repo/dspace2/core/trunk/api/src/main/java/org/dspace/services/mixins/
DSpace 2.0 Modelling Services DSpace 2.0 Expressing DSpace Domain Model In RDF
GSoC Summer of Code project that lead to a simpler api that is more Triplestore like. Services focus on Metadata vs Binar Data, not Read vs Write.
http://scm.dspace.org/svn/repo/modules/dspace-storage/trunk/api/src/main/java/org/dspace/services/
GSOC10 - Backport of DSpace 2 Storage Services API for DSpace 1.x
GSOC10 - Storage Service Implementations Based on Semantic Content Repository
These represent the Storage Backend we would eventually want to see for DSpace 2.x to wire onto a Storage System. It moves DSpace away from the ridged hardcoded DSpaceObject data model and allows DSpace Applications to define any graph of Objects with properties that represent content, could apply to JCR, Fedora, Triplestores, etc.
How do we learn from this body of work and bring it into the Fedora HLS work so it can be informed by what is happening in the community at large?
Greg Jansen
I'd like to be able to provide alternate strategies for copying datastreams from external locations to managed locations, or just between locations more generally. It would be nice if we could achieve something like that without modifications to the high level storage module, only configuring some sort of custom transfer strategy class.
Perhaps let DataSources be POJOs, then allow various copy strategies to intercept the copy function for DataSources of choice. This would allow more efficient transfers in many cases where streaming through Fedora is not optimal.
In my particular case I'm thinking of a special sort of strategy for staged file, which is renaming, followed by grid-based replication. We ingest most of our data through a staging server within a grid. So the effect for us can be ingest without having to physically move data at ingest time. We would intercept the copy function whenever both DataSources were locations within the same grid, our strategy would perform a logical move to the managed location, set permissions, then trigger post-runtime replication.