DSpace 2.0 Storage Service Implementations Based on Semantic Content Repository - Yigang Zhou
Develop DSpace storage service implementations based on semantic content repositories (TripleStore). - Yigang Zhou
Abstract
On the on hand, DSpace 2.0 has a generalized storage service API which allows a DSpace 2.0 repository to use many possible systems to store digital repository data. On the other hand, semantic content repositories (triplestores) such as Mulgara, Sesame and Tupelo are available for semantic data storage, which are suitable for storing blobs and metadata from DSpace represented in the form of RDF triples. In this project, I will develop DSpace storage service implementations based on semantic content repositories. Finally, I will cooperate with Andrius Blažinskas who is working on another GSoC 2010 project of back-porting DSpace 2.0 storage interfaces to 1.x, to make triplestore storage service ready to use for DSpace 1.x.
Project Title: |
DSpace 2.0 Storage Service Implementations Based on Semantic Content Repository |
Student: |
Yigang Zhou, Wuhan University, P.R. China |
Mentors: |
Mark Diggory |
Contacting author: |
egang DOT zhou AT gmail DOT com |
SCM Location for Project: |
http://scm.dspace.org/svn/repo/sandbox/gsoc/2010/triplestore/ |
Architecture
The design principles of the architecture should be:
- The triplestore StorageService can be compatible to all kinds of semantic data storages (e.g. Sesame, Jena, etc.), through different configuration settings.
- Other new semantic data storages (e.g. Mulgura) can be easily plugin into the architecture without much efforts and need not modify code of the API.
- The triplestore StorageService/BinaryStorageService should be able to accommodate both StorageEntity/StorageProperty for entity metadata information in the form of RDF triples and StorageBinary for blobs (binary/textual data).
This architecture is quite similar with JackrabbitStorageService, which sits in front of all kinds of PersistenceManagers for different databases.
As is shown in Figure 1, we have a TupeloService holding a reference to a Context (i.e. a triplestore instance in Tupelo) object to support low level triple/blob operations. The Context is actually a UnionContext, which combines a sub Context A (e.g. SesameContext or MulguraContext) for RDF triples and a blob-related sub Context B (e.g. FileContext). Additionally, high level functions like read/write transaction and object-triple mapping are also provided by TupeloService through ThingSession and BeanSession powered by Tupelo.
Based on TupeloServcie, TupeloStorageService should support StorageSerivce and BinaryStorageService. All the functions related to StorageService will be dispatched to Context A, while those of BinaryStorageService can be delivered to Context B. It's quite flexible for the choices of Context A and B. No restrictions on the combination groups. For example, we can use HashFileContext or DatabaseContext for Context A, with SesameContext, Sesame2Context or PersisenceJenaContext as Context B. We can also use Spring configuration for Context A and B injections into the UnionContext. Currently, there's no Mulgara implementation of Tupelo Context. But I can develop a new MulguraContext in this GSoC project. The new Context will not affect the source code of TupeloService or TupeloStorageService at all. It can be easily plugin into the architecture through Spring configuration.
We seperate TupeloService from TupeloStorageService, so that the API of StorageService is decoupled with different semantic triplestore implemantations.
TupeloService
TupeloService Functions
As is showed in the following table, TupeloService provides both triple and blob related functions in different levels. Low level functions are process-oriented: different Operators (e.g. TripleCounter, BlobFetcher) can be performed by TupeloService.perform() method. High level ones are object-oriented: users can create ThingSession or BeanSession to manipulate Thing (i.e. RDF Resource wrapper Class) objects, and make batch updates using sessions. Especially for BeanSession, user can create object-triple mapping and Tupelo will automatically manage the transformations from triples to objects/beans and vice versa.
|
Low Level Functions |
High Level Functions |
---|---|---|
|
void TupeloService.perform(Operator operator) |
ThingSession TupeloService.createThingSession(), |
Triple Functions |
TripleCounter, TripleMatcher, TripleFetcher, TripleWriter |
Set<Thing> ThingSession.getThings(Resource predicate, Object value), |
Blob Functions |
BlobFetcher, BlobIterater, BlobRemover, BlobWriter |
ThingSession.removeBlob(Resource subject), |
TupeloService Access Policy
The policies are about how a TupeloService accesses an underling Tupelo Context, when the Context may be existing or not. The policies are: open, access and renew.
Policy |
Description |
---|---|
OPEN |
Create an TupeloService instance that accesses an existing Tupelo Context. |
ACCESS |
Connect an TupeloService instance to a Tupelo Context. |
RENEW |
Create an TupeloService instance that accesses a Tupelo Context. |
The policy can be configured by Spring as an argument of the constructor method of TupeloService:
<bean id="org.dspace.serivces.TupeloService" class="org.dspace.tupelo.TupeloServiceImpl"> <constructor-arg ref="org.tupeloproject.kernel.Context" /> <constructor-arg value="renew" /> </bean>
Life Cycle Control of Tupelo Context
The life cycle methods of Tupelo Context are as follows:
Method |
Description |
---|---|
void Context.initialize() |
Some Context implementations require persistent resources (for instance storage space). |
boolean Context.open() |
Acquire resources necessary for performing operations (e.g., a database connection). |
boolean Context.close() |
dispose of any resources held by the context. |
void destroy() |
Release any persistent resources associated with this Context. |
The access policy of TupeloService requests Tupelo Context to strictly control its life cycle. Actually Tupelo recommends doing in this way. But not all of the Context implementations follow this principle. For example, UnionContext lacks of life cycle control methods. Instead, we can use "org.dspace.tupelo.context.LifeCycleControlledUnionContext" to solve this problem. In short, all the Tupelo Contexts that used by TupeloService MUST control their life cycles appropriately, or unexpected behaviors will happen during startuping, accessing or shutdowning TupeloService.
Discussions
TripleStore RDF API Battles
Many triplestore implementations are "battling" to be the penultimate API that one would implement against with configuration of the others as underlying storage. The state-of-art of mainstream Java based triplestores are summarised as follows:
- Tupelo defines its own RDF API and makes Jena, Sesame as its underlying storage in the form of Context.
- AllegroGraph defines its own RDF API and provides Jena, Sesame wrapper classes for users to access AllegroGraph using Jena, Sesame API.
- Mulgara use JRDF's RDF API and provides a bridge to Jena API.
- Jena, Sesame defines their own stand alone RDF APIs.
In a word, their's no standard Java based RDF API. Neither is Tupelo. But we choose Tupelo as a facade/gateway for DSpace triplestore because:
- Tupelo is designed naturally to be a pluggable RDF storage solution. Other triplestores can be easily developed as back-ends through extending Context, BaseContext, BasicLocalContext, etc.
- Other triplestores can not store blobs, since they (e.g. binary files) can not be encoded into RDF triples. But Tupelo supports blob storage naturally and integrate it well with triple based metadata storage (using UnionContext).
- Some triplestores support other RDF API through bridges or wrappers, but not underlying storages.
Mappings Between DSpace and Tupelo
DSpace 2.0 contains StorageEntity, StorageProperty, StorageRelation and StroageBinary. How to map them into Tupelo RDF storage API?
Map StorageEntity into Tupelo Resource.
The path(s) as metadata information triples. The entityId will be encoded into a Tupelo Resource URI.
DSpace |
Tupelo |
---|---|
StorageEntity |
Tupelo Resource |
entityId |
URI: "http://www.dspace.org/rdf/entity#" + escape(entityId) |
entity path(s) |
ds:hasPath (xsd:string) |
Map StorageProperty into Tupelo metadata information triple.
The datatype of the object of the triple is different according to different StorageProperty types. The StorageProperty name will be encoded into a Tupelo Resource URI.
DSpace |
Tupelo |
---|---|
StorageProperty (different types) |
Tupelo Triple |
name |
URI: "http://www.dspace.org/rdf/property#" + escape(name) |
StorageEntity.class |
Tupelo Resource |
Boolean.class |
xsd:boolean |
Double.class |
xsd:double |
... |
... |
Map the property name to construct a RDF Resource. There are three cases:
(a) if the namespace can be recognized by NamespaceMapping, e.g.:
dc:title -> "http://purl.org/dc/elements/1.1/title"
(b) if the namespace can not be recognized by NamespaceMapping, e.g.:
eg:title -> "http://www.dspace.org/rdf/property#eg:title"
(c) if there's no namespace, e.g.:
book title -> "http://www.dspace.org/rdf/property#book%20title"
We deal with (b) and (c) with the same policy.
Map StorageBinary into Tupelo Blob Resource.
DSpace |
Tupelo |
---|---|
StorageBinary |
Tupelo Blob Resource (e.g. in file system) |
Project Plan
- Before mid evaluation (July 14th):
- Design and develop TupeloService
- Design and develop TupeloStorageService
- Test TupeloStorageService with existing triplestore Context, e.g. SesameContext, JenaContext
- After mid evaluation
- Develop MulgaraContext
- Work with Andrius for back-porting DSpace 2.0 storage interfaces to 1.x.
Future Work
The DSpace Storage API has recently been modified in another GSoC 2010 project by Andrius. In future, we should investigate how the changes will affect the TupleStorageService and make modifications if necessary.