A first pass at a first sprint for Fedora policy-driven persistence.

Background (assumptions and a priori conclusions):

 

  1. Creating a new Fedora-specific DSL for discussing persistence policy would be expensive and time-consuming and there is no obvious candidate for such a DSL to hand.
    • We should therefore reuse extant Fedora machinery for the purpose as much as possible.

  2. Object relationships (RDF) are the most flexible means of description in Fedora that retain the advantage of simplicity.
    • We should therefore use RDF assertions on resources to indicate persistence policy, at least until we discover a reason to do otherwise.

  3. Creating new objects to represent persistence stores would clutter the relationship graph in Fedora with resources that are not part of intellectual arrangement.
    • Therefore the relationships that declare persistence policy should not require objects that represent storage. 
  4. Akubra will remain the underlying storage abstraction for Fedora.

  5. We will want to make it as easy as possible to reuse deployed Akubra machinery.
    • Therefore efforts should include work on Akubra as well as Fedora.

There are some obvious limitations to this approach. For example, it is not possible to distinguish in-line XML datastreams from the objects in which they occur if we accept Akubra's construction of a persisted resource. I do not think that these kinds of limitations justify a move away from Akubra at this early stage.

Success:

We require a definition of success. Here is one:

The machinery created in the first sprint should enable a repository to fulfill the following simple use cases:

  1. Separating objects and datastreams into independent stores.
  2. Separating large resources and small resources, for some simple definition of "large" and "small", into independent stores.
  3. Separating objects and their datastreams into independent stores based on the owner of the object.

For this purpose, it will be necessary to equip Fedora with the means to differentiate resources to Akubra. Much of this work has been accomplished in FCREPO-954, which provided Fedora with the ability to supply storage hints to Akubra. I have since created a simple component that uses this functionality to populate the Akubra hints parameter with RDF relationships. See here.Additionally, I have created a simple Akubra implementation that uses the presence or absence of such hints to select from a list of Akubra BlobStores. See here.These components could be a useful starting place, if my initial assumptions and conclusions are found tenable. I will assume for the moment that such will be the case.

Work:

In order to fulfill the three use cases I've offered, the following work could then be accomplished:

  1. My code must be extended to consider the possibility that a resource's persistence-governing metadata will change over time.
    1. This implies work in Fedora (primarily in org.fcrepo.server.storage) and in Akubra. One architectural difficulty arises in a two-fold asymmetry:
      1. Fedora does not (by design) inform Akubra as to the purpose of a persistence action. Is it creating a new resource? Replacing an old resource? Creating a new storage unit that is semantically identical to an old one? Akubra doesn't know.
      2. Akubra does not inform Fedora about how it does what it does. If an object's persistence-governing metadata changes in such a way that Akubra will store it differently, Fedora will not know.
    2. This creates difficulties for efficient design. For example, a naive implementation from the side of Fedora would delete and recreate a resource each time its persistence-governing metadata changes. But that may not be necessary, and it may be very expensive.
    3. The naive way to avoid this problem is for the storage subsystem to be represented in Fedora in some way. This goes against my 4th assumption and would make it more difficult to start from my work, but it may turn out to be the best way. This deserves careful thought, because it is a serious architectural question.
    4. It may be best to rearrange the contract between Akubra and Fedora to permit Akubra to notify Fedora about its activitities.
  2. For the first use case an example Akubra Spring configuration must be created for each case. A sample is found here. Because RDF information alone is sufficient to distinguish Fedora objects from datastreams, that is all that would be required specific to this case.
  3. For the second,
    1. Some reasonable definition of "small" and "large" might be confirmed.
    2. Either fresh code must be written or my Fedora-side code must be extended to include not only relationship information in the Akubra hints but resource property information, most importantly datastream size.
    3. Either fresh code must be written or my Akubra-side code must be extended to differentiate stores not only based on the presence of hints, but on their comparable values. I think it possible that we may end up in the long run using reflection on the type of hints to accomplish this flexibly. See appendix below for a criticism of the Akubra hint mechanism in this regard and a solution.
    4. An example Akubra Spring configuration must be created to demonstrate the new technique.
  4. For the third use case
    1. The changes in 3.b and 3.c will be required.
    2. In this case, the Akubra-side machinery must be made able to differentiate based on the lexical value of hints. One question that arises is as to whether we intend to provide machinery to enumerate these lexical values, or whether they will remain uncontrolled. The first option is more cumbersome but offers some performance advantage.
    3. An example Akubra Spring configuration must be created to demonstrate the new technique.
  5. Extensive testing must occur.
    1. At the very least, we should be able to demonstrate that the new machinery can provide performance comparable to current Fedora persistence, which simply differentiates objects and datastreams by declaring isolated stores for both.
    2. Being able to demonstrate that Fedora could now differentiate storage based on resource-size or ownership would itself be a tout-able accomplishment.

Appendix:

The problem of Akubra's hint type.The current type of Akubra hints is Map<String,String>. I think this is unnecessarily restrictive. For example, in my abovementioned work, I was forced to pack an RDF predicate and object into a single string, which isn't onerous but is inelegant. If we intend to allow distinctions between comparable values or enumerables via Akubra's current hints, we will find ourselves doing a lot of packing and unpacking. I suggest that we widen the type of Akubra's hints to Map<Object,Object>. Unfortunately, because Java is a bit thick about type variance this breaks client code, such as Fedora, which must then be altered to fit. I have done this in:

https://github.com/akubra/akubra/tree/widening-hints

and

https://github.com/fcrepo/fcrepo/tree/akubra-with-widened-hints

 

  • No labels