Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

<?xml version="1.0" encoding="utf-8"?>
<html>
Title: DSpace & Fedora integration

...

Two popular digital content repositories - DSpace and Fedora are quite different in nature and have different data models. Both of the repositories have different advantages. Integration of these two repositories would allow wider digital content dissemination and management possibilities. Utilizing repositories in a separate way, digital content must be prepared and replicated for each of them. To avoid this replication a specific driver implementation, allowing one repository to access data through another repository, must be created. It is obvious that a lot of work must be done to fully achieve desired result, so my proposal is to create a working storage driver prototype for DSpace which will allow storing, accessing and managing at least basic DSpace data in Fedora repository considering its relationships and associated policy.

...

Figure 1 provides relative DSpace data model, which is a little bit extended version of basic model (http://www.dspace.org/index.php?option=com_content&amp;task=view&amp;id=149Image Removed). Several additional fields are added considering database fields.
Possible DSpace data model mapping to Fedoras is provided in Figure 2. In diagram, every Fedora object representing DSpace entity has RELS-EXT datastream. Fragment of its XML contents is provided to show how relationships between objects will be implemented physically.

...

DSpace Item entity contains associated Dublin Core qualified metadata XML file. Fedora does provide default datastream with DC identifier for Dublin Core metadata in every object, so it can be used to contain these fields.

Image RemovedImage Added

Image RemovedImage Added

<!--
Also it should be noted, that Item entity has two types of relation with Collection entity. In Fedora, simple relations between objects are expressed in RELS-EXT using isMemberOf relation type. However, custom relations can easily be introduced, so additional relation isIncludedBy is added here to emphasize inclusion rather than ownership. Not really sure if it is good to use custom relation, but it works.
-->

Relations between mapped DSpace entities in Fedora can be found by searching resource index with ITQL queries. Such a query example:

Code Block
select $object from <#ri>
where  $object <fedora-rels-ext:isMemberOf> 
   <info:fedora/demo:Collection~123.456-789>

...

Code Block
<?xml version="1.0" encoding="UTF-8" ?> 
<sparql xmlns="http://www.w3.org/2001/sw/DataAccess/rf1/result">
<head>
  <variable name="object" /> 
</head>
<results>
  <result>
    <object uri="info:fedora/demo:Item~213.456-789" /> 
  </result>
  <result>
    <object uri="info:fedora/demo:Item~223.456-789" /> 
  </result>
</results>
</sparql>

<!--
The same way can be formed query for included Items:

Code Block
select $object from <#ri>
where  $object <fedora-rels-ext:isIncludedBy> 
   <info:fedora/demo:Collection~123.456-789>

...

More tricky situation is with DSpace Bitstreams. Basically, they are mapped to Fedora datastreams. When ingested, every Bitstream is put into separate temporary Fedora object. Later, when Bitstream is associated with any entity (Bundle, etc), it is transferred to this entity object as Fedora datastream. In some special cases, when bitstream is linked to several entities, Bitstream in Fedora is moved and kept in separate dedicated object, with relations in RELS-EXT to other parent entities.
This separate Bitstream object scenario also satisfies the case, when Fedora is used only to store Bitstreams and small associated metadata set, without preserving full model structure (only FedoraBitStore functionality). In this case, this object is not temporary but always permanent. However, the idea of one datastream (Bitstream) per one object still isn't that attractive...

...

Code Block
<rdf:RDF xmlns:dspace="http://www.dspace.org/elements/" 
         xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#
         ... ">
  <rdf:Description rdf:about="info:fedora/demo:Bundle~1">
    <dspace:bitstreamId>9</dspace:bitstreamId>
    ...
  </rdf:Description>
</rdf:RDF>

...

Fedora objects (which represent DSpace entities) PIDs can possibly be formed using general pattern: <Fedora namespace ID>:<DSpace entity type>~<DSpace entity ID>. At the moment, DSpace entities IDs are internal DSpace identifiers (method getID() is used). Examples of IDs provided in Table 1.

<!--
Fedora PID must satisfy pattern:

Code Block
*'(**\[A-Za-z0-9|A-Za-z0-9\]**\|*{*}-\|\.)

...

-*:-( *{-}+(+{-}{*}{*}{-}+\[A-Za-z0-9|A-Za-z0-9\\]+{-}{*}{*}{-}+)\|+{-}{*}+\|\.\|~\|_\|(%++\[0-9A-F|0-9A-F\]++\{2\}))+'

...

so So it does not allow some special characters like slash ("/"), which is used in DSpace handles. These characters must be escaped or replaced. Currently I have replaced "/" by "-".

Bundle identifier is formed combining parent Item handle and DSpace Bundle ID (possibly from database), separated by underscore symbol "_".
-->

Fedora datastream, representing Bitstream, ID can be formed in similar way by using pattern: Bitstream.<Bitstream ID>, since symbol "~" is not allowed in datastreams IDs.

<!-- It is also possible to use Bitstream~313.456-789_7_24 as ID, but since part 313.456-789_7 will already be included in Fedora object (Bundle) ID, there is no need for replication.
-->

Panel
borderStyle
borderColor#ccc
bgColor#fff
borderStyledashed
titleTable 1: Identifiersdashed

Fedora entity representing DSpace entity

ID pattern

ID example

Fedora Object (Community)

<Fedora namespace ID>: Community~<Community ID>

demo:Community~1

Fedora Object (Collection)

<Fedora namespace ID>: Collection~<Collection ID>

demo:Collection~1

Fedora Object (Item)

<Fedora namespace ID>: Item~<Item ID>

demo:Item~1

Fedora Object (Bundle)

<Fedora namespace ID>: Bundle~<Bundle ID>

demo:Bundle~1

Fedora Datastream (Bitstream)

Bitstream.<Bitstream ID>

Bitstream.1

...

I propose to create a driver prototype which will provide DSpace the possibility to access Fedora repository as a primary storage to store bitstreams and metadata. Driver classes will have the same method interfaces as current DSpace "org.dspace.storage" package classes and will be accessed in the same manner. Driver will communicate directly with Fedora repository using its SOAP API (API-A and API-M). <!--

To prevent software defects, all written code will be tested using JUnit. I will also provide code documentation.
-->

Comments (RLR):

The programmatic way DSpace accesses bitstreams and metadata is very different. Bitstreams are treated as opaque simple objects
(although a few additional properties are required like a checksum). There is already some preliminary work on creating a clean abstraction to the underlying storage system (see http://wiki.dspace.org/index.php/PluggableStorage

Image Removed

). I would recommend starting with this 'Bitstore' interface, since it will be incorporated into DSpace+1.6, and already supports several storage back-ends: filesystem, Storage Resource Broker, Amazon S3, and Sun's HoneyComb. The last 2 are essentially http client calls, so they already resemble using the Fedora SOAP API.

But the metadata is another story - DSpace does very little to abstract away from direct JDBC/SQL calls into a RDBMS. I think here the question of a 'driver' is less obvious, and you might want to explore a few designs before committing a lot of work. For example: could the metadata be placed in a bitstream and stored through the other driver? This is not a functional apping, but would satisfy e.g. a replication scenario. Should you attempt a high level metadata abstraction that bypasses current DSpace (but could be retrofitted into it)? Etc. I am just throwing out thoughts to elicit additional discussion here. |

<!--
After initial analysis, the decision was made to start work from interface library (driver), which will allow managing basic DSpace model entities (Community, Collection, Item, etc.) in Fedora repository. This library will be independent from DSpace itself. -->

<...>

Currently implemented driver actually is a combined DAOs (http://wiki.dspace.org/index.php/DAO+PrototypeImage Removed) and BitStore (http://wiki.dspace.org/index.php/PluggableStorageImage Removed) interfaces implementation. It can be used as both: DAO implementation or much more simplier standalone BitStore implementation. Actually, FedoraDAOs directly utilizes FedoraBitStore, bypassing BitstreamStorageManager.

Image RemovedImage Added

Driver allows store and retrieve Bitstreams, while metadata is only stored in Fedora. Relations are also preserved between Fedora objects using RELS-EXT.

...

  • Policy mapping implementation (user management service must be created to associate policy with users and groups?).

...