Stakeholders

Sprints

Sprint 1

Sprint 3

Sprint 4

Use cases

  1. Transfer between Fedora and external preservation systems, such as APTrust, MetaArchive, LOCKSS, DPN, Archivematica, etc

  2. Package [Export] the content of a single Fedora container and all its descendant resources

  3. Transfer between fedora instances or (more generally) from Fedora to an LDP archive

  4. load [Import] the contents of a package into a specified container.

  5. Round-tripping resources in Fedora in support of backup/restore

    1. A start has been made on this in FCREPO-1990

    2. The implementation referenced in the above ticket is not dead, though not actively being worked on at the moment; pull requests welcomed (though others may well wish to take it in a different direction).

    3. A rebuilder that:

      1. Is not solely dependent on a intact backup of the repository index

      2. Works off shredded serializations that can be supported with file preservation techniques

      3. Can recover as much as possible of a repository in the face of integrity issues (supports partial recovery)

      4. Supports gathering copies of the shreds (serializations) from multiple sources to recover a repository

  6. Round-tripping resources in Fedora in support of Fedora repository version upgrades

  7. Batch loading arbitrary sets of resources from metadata spreadsheet and binaries (may well be difficult – or not worth it – to try to generalize such a feature).

  8. Import or export containers or binaries using add, overwrite, or delete operations. Configure the data model and the source and the target for each resource that will be updated. Allow target containers to be non-empty before import and source containers to be non-empty after export. Maintain ordering, etc. Support versioning. Examples: add issues to a publication; add fragments to a manuscript; add data sets to a longitudinal study; add time-series images from telescopes; remove resources determined to be under copyright; release resources after restrictions on access have expired.

    1. Perform multiple metadata-only exports, and then restore an earlier version from an export.

Use cases yet to be rolled into requirements

  1. Import objects from an external system (such as Figshare, where a research data object might be prepared) into a Fedora preservation repository with either Hydra or Islandora on top. (Implies compliance with Hydra and/or Islandora object models)

  2. To migrate from internal content to external content, export metadata only and then import it into another repository.  The links to the new external content locations would be added afterwards.

Requirements

External Systems

  1.   PHASE 2 Support import from and export to a TBD list of external systems.

    1. APTrust - University of Maryland (Joshua Westgard)

    2. Archivematica - Artefactual Systems (Justin Simpson)

    3. MetaArchive - Penn State (Ben Goldman)

    4. Perseids - Tufts - Bridget Almas

General

  1. PHASE 1 Support transacting in RDF

  2. PHASE 1  Support allowing the option to include Binaries

  3. PHASE 1  Support references from exported resources to other exported resources

  4. PHASE 2 Support transacting in BagIt bags

  5. PHASE 1  Support import into a non-existing Fedora container

  6. PHASE 2 Support import into an existing, empty Fedora container

  7. PHASE 3 Support import into an existing, non-empty Fedora container with various policies: add, overwrite, delete, version, skip

  8. PHASE 3 Support export of resource versions

  9. PHASE 3 Support import of resource versions

  10. PHASE 1  Support export of resource and its "members" based on the ldp:contains predicate

  11. PHASE 2 Support export of resource and its "members" based on a user-provided membership predicate

  12. Support recursive RDF insert/updates with LDP Indirect Container specified POST (and PUT / PATCH?) (ref: FCREPO-2042)

Round-tripping

Defined as: Export all or a subset of a Fedora repository and importing the export artifacts into a Fedora repository.

  1. PHASE 3 Support preservation of dates during round-tripping 

  2. PHASE 3 Support preservation of version snapshots during round-tripping 

  3. PHASE 1  The URIs of the round-tripped resources must be the same as the original URIs

  4. PHASE 3 Support lossless round-tripping.  (ie, if you export a resource, delete that resource and import there is no difference from if you had never performed any of those operations).

BagIt

  1. PHASE 2 Single resource bags

  2. PHASE 2 The structure and scope of accepted and produced BagIt bags must be configurable (resource)

    1. Clarification: structure relates to required and optional tagfiles in the bag

    2. Clarification: scope relates to contents of the bag, e.g. single object or object and all members based on specific membership predicate

  3. PHASE 3 Multi-resource bags

  4. PHASE 3 Unambiguously support linking between resources within a bag, and from resources in the bag to resources outside the bag

    1. e.g. for bagged resources A and B, if A contains statement <A> myns:rel <B>, then it is unambiguous that B is a resource in the bag.  Suppose some archive ingests the bag and exposes its contents as web resources with URIs P and Q. If the archive preserves intra-bag links, resource P will have statement <P> myns:rel <Q>.  Likewise, if A contains external link <A> myns:rel2 <http://example.org/outside/the/bag>, then an archive that preserves links will have <P> myns:rel2 <http://example.org/outside/the/bag>

Verification Tool

  1. PHASE 2 Verify same number of resources on disk as in fcrepo

  2. PHASE 2 Verify same number of resources in fcrepo as on disk

  3. PHASE 2 Verify same checksum for binaries

  4. PHASE 2 Verify same triples for containers

  5. PHASE 2 Record which resources have been verified (Include checksum for binary resources)

  6. PHASE 2 Verify subset of repository resources

  7. PHASE 3 Verify fcrepo to fcrepo

  8. PHASE 3 Verify disk to disk

  9. PHASE 3 Use generated config file as sole input

Considerations

  • Import/export performance as is possible under the assumption that this work is done via the REST interface

Resources

Meetings

  • No labels

28 Comments

  1. I just want to note that the "Requirements" section above contains features that have been requested in the past, but deemed unworkable or undesirable.  In particular, proposals to allow including triples about multiple subjects in a single request, and allowing users to alter system dates have both been rejected before.

  2. I am trying to track down the history of automatic skolemization of blank nodes which may be relevant here.  It was discussed in an email thread here and also more recently in the 2016-04-14 tech meeting here.  This is an important issue for my use case which involves creating JSON-LD lists from LDP containers that have been indexed into a triplestore. I have just found that RDF lists with skolemized ids where bnodes should be cannot be connverted into JSON-LD lists with the jsonld.fromRDF method because this method identifies the first node (i.e. "the head") in the list with one criteria only: it does not have the index '_:', meaning it is not a bnode.  So, maintaining bnodes in FCREPO RDF serialization seems to be important for JSON-LD's understanding of how a RDF collection should exist.  

    Please excuse my ignorance of the history of RDF collection and bnode implementation in FCREPO.  I welcome any links to the current status of this discussion.  

    1. We have been trying to summarize the state of Fedora's relationship with bnodes, but so far have only gotten as far as:

    2. There is absolutely nothing in the notion of RDF collections that has any reference to blank nodes. That seems to be a bug in the JSON-LD library you are using.

  3. "Absolutely nothing" may be a bit strong since the link you provided says the RDF Collection vocabulary " is intended for use typically in a context where a container is described using blank nodes to connect a 'well-formed' sequence of items".  The method in the JSON-LD library is a brilliant piece of code, imho, and the way that it finds the beginning of a list is valid, though it is limited in scope.  The substantive problem seems to be in the RDF Collection vocabulary which while it provides an end terminator with rdf:nil, does not provide a decidable beginning terminator since rdf:first is applied recursively.  The algorithmic problem is how to find the head of a list with a reverse iteration if there is no terminator for it?

    1. No, "typically" is not language that enables a library to rely on that assumption. The parser should be able to understand concrete nodes in those positions correctly.

  4. Christopher,

    I'm totally sympathetic and also love JSON-LD. I was in the same boat as you and very frustrated. Luckily A. Soroka talked me down from the ledge and I've gone on with life using hash-URIs for ordered lists instead.

    My understanding is that JSON-LD's lists, which translate down to RDF lists, are problematic because they rely on blank nodes, and those have no defined lifespan beyond the post in which they are made. I guess they'd need some complicated garbage collector to be implemented in fedora, and that's no fun for anybody.

    I've been really meaning to write down some different ways to represent ordered lists, using hash-URIs on the page Andrew pointed to. They're not quite as pretty as lists but not nearly as ugly as I'd feared. I'll make a point to get to that soon, and my apologies for taking a long time.

    1. I agree that bnodes should not be implemented directly in an LDP repository and representing them as skolem URIs is correct. In the RDF 1.1 spec on "Replacing Blank Nodes with IRIs", a suggestion is made regarding minting skolem IRIs with a reference back to their origin as blank nodes.  Making an (import restricted) bnode reference "list type" skolem could then be useful when indexing collections to a triplestore or exporting/serializing as RDF, where a "round-trip" would (somehow) map to an arbitrary blank node reference.  A bnode in a list functions just as "meaningless" placeholder, and as such is not really metadata, and thus has no need for a lifecycle.  This seems one reasonable way forward, because I do not think that avoiding the relevance of blank nodes for major implementations such as RDF collections is rational.

      1. I disagree violently with the characterization of RDF collections as a "major implementation". Otherwise, I invite you to be a little more specific than the term "(somehow)" you use above in referring to how this idea would work. Ideally, as specific as a code contribution.

         

    2. Just to be very clear (this point seems to be getting lost) RDF lists do not rely on blank nodes. Blank nodes are certainly the most common implementation pattern for them, but there is no requirement to use blank nodes.

      Otherwise, Martin Haye is quite right to describe the implementation details of blank nodes as complicated. That is actually rather an understatement. The tech team discussed this extensively, investigative work (referred to on-list) was undertaken, and the effort was abandoned. I cannot see what has changed in any way to make any difference at this point in time.

      1. I'd like to second what A. Soroka said here.  I have long been a proponent of providing at least some support for blank nodes in F4, and have participated in the discussions and implementation over the last few years.  The current implementation skolemizes blank nodes on input, and also provides some support for hash URIs (which in the current implementation are stored as child nodes of the main resource).  I believe this functionality supports all of the documented use cases, including ingesting metadata (like rdf:List, or JSON-LD with lists, or MADS) that typically use blank nodes, and stores them in the repository in an intelligible way.  It's true that when the RDF is retrieved that it no longer has blank nodes, but I would point you in the direction of API-X if you wanted to change that behavior.

        There have been a few attempts to add additional functionality in this area, but none of them has been carried to completion.  IMHO, there is a pretty high bar for suggesting changes in this area, and a general skepticism that additional functionality can — or should — be implemented.

        That said, I'm happy to help anyone figure out how to use the current functionality, and to improve the documentation if it's not clear.

      2. I think I may be able to offer a code contribution towards a solution to the general problem of "Exporting" Skolem IRIs.   This still requires some research on my part into the specifics of your implementation.  I  see several general work areas in my cursory evaluation, and perhaps we can move this technical discussion to another forum?  What I would like to do is basic in principle.  I think that it is possible to do this in the context of fcrepo-exts rather than the kernel, which perhaps makes it a bit less difficult.  

        First, is it accurate to assume that all nodes with the type fedora:Skolem originated as blank nodes?  If so, then it is a question of identifying and transforming the subjects of this resource type for the indexers with some extension to fcrepo-transform, and using a "bnode generator" in a transform configuration.  The bnode generator could maybe extract/transform the genid to ensure some consistency with the skolem node.   Note: the triplestore indexer does not currently use a transform configuration, but the solr indexer does, so fcrepo-indexing-triplestore would need to be touched as well.   Just some preliminary thoughts, here.

        I have to assert again that, unfortunately, the current concrete skolem node functionality does not support my "Export" use case.  I have no problem with the way it works for importing.   This is not because it is not possible to parse RDF Collections constructed with concrete nodes, but because a list of constructed with bnodes is JSON-LD's (justifiable?) current expectation.  I should also add that their normalization method "_listToRDF" also generates bnodes for all JSON-LD "@lists".   The choice seems to be try to change JSON-LD expectations or try to match them.  Since I do not have a proposal to counter their assumptions, the latter seems to be more doable.  Furthermore, the concept of the "@list" is a major part of JSON-LD as all arrays are equivalent to RDF collections when canonicalized.  This seems to elevate the importance of the rather arcane use of RDF collections now.    

        1. Thank you for being ready to step up with a code contribution. That is decidedly admirable. You are certainly right that this is not the right place to discuss such matters in detail. That place is the tech mailing list: fedora-tech@googlegroups.com. Please introduce this discussion there-- I know we'll get a much broader range of input and helpful advice.

  5. As promised, I have documented a variety of different ways to represent ordering using hash-URIs (and not blank nodes). I'm fairly new to this game so an extra pair of eyes wouldn't hurt, looking for rookie errors on my part: Ordering

    This documentation is totally unrelated of the effort to improve skolemization of blank nodes, which sounds good to me at least in theory.

  6. We've done some work with packaging linked data in bags at JHU, which could possibly be of interest to this effort.  The driving use cases are a little bit different (i.e. focusing on tools generate bags of linked data that can be subsequently ingested into a Fedora instance), but it could be useful to look at.

    Here's a description of our particular approach, and a sample bag generated by a tool which packages up content on one's file system, and allows editing of metadata.  A service then ingests the custodial content of the bag as Fedora resources.

    Some points of interest of the bag content:

    • We ultimately ended up using a "bag://" URI scheme to refer to resources within a bag; this has the consequence requiring the consumer of the bag to translate these URIs into something resolvable, if the resources are to have a life outside of the bag.
    • The sample bag has all LDP relationships in a manifest external to /data.  The domain model being used in this particular package (and the tool that produced the package) were not based on LDP.  The LDP relationships in this case are advisory/supplemental, and not part of the custodial content of the bag; they may be reasonably ignored by a non-LDP consumer of the bag.  Use cases for this pattern are probably out of scope for Fedora import/export.
    • The organization of the resources in the bag (e.g. /data/obj, /data/bin) are an implementation detail of the tool that created the bag;  the specification (and the service that ingests the content into Fedora) does not ascribe any specific meaning to the manner in which resources are organized within /data.

    Notes on ingesting the bags

    • The order of ingest turned out to be significant; before ingesting any content an "LDP dependency order" needs to be determined.  This is to prevent children from being ingested before parents, or descriptions of binaries ingested before binaries. 
    • Unlike Fedora 3, in Fedora 4 (as a consequence of its JCR implementation) has a notion of referential integrity - so if one object links to another, the linked object must exist.  To get around this (also also, incidentally, to allow bidirectional relationships parent->child; child->parent), we leveraged the fact that bag URIs are opaque.  Our ingest service performs two passes.  The first pass deposits resources as-is in LDP dependency order (bag URIs and all) maintaining a map from bag URI to fedora resource URI in the process.  The second pass updates the objects to replace bag URIs with Fedora resource URIs.  Both passes occur within a single transaction.

     

     

     

     

  7. A. Soroka mentioned the semantic content packages draft in IRC today and I thought it worth adding to the Resources list above: https://www.ietf.org/archive/id/draft-wilper-semantic-content-pkgs-00.txt

  8. for bagged resources A and B, if A contains statement <A> myns:rel <B>, then it is unambiguous that B is a resource in the bag.

    It's not clear to me how this situation arises if we have a one-to-one bag-to-resource match.

    1. Hi Adam, I'm not sure I understand what you mean by "one-to-one bag-to-resource match".

      In this case <A> and <B> are resources within the same bag.  So the uri for <B> would resolve a resource within that bag.  And if resources <A> and <B> were deposited into fedora, then (ideally) the relationship as it was inside the bag (<A> myns:rel <B>) would be maintained - (perhaps with new URIs as assigned by fedora) as <P> myns:rel <Q>.

       

      We opted to define a new uri scheme (bag://), and used it when referencing resources within a bag.

      1. Elliot Metsger, we must be missing a beat between us, because the condition "one-to-one bag-to-resource match" seems clear to me. It implies that for each resource (for our instant purposes, each thing addressed by a repository URI) there is written one unique bag, and in each bag, just one serialized resource. It follows that there is just one thing serialized in a bag that can be addressed by (one) URI and the idea of "bagged resources <A> and <B>, both in one bag", obtains no purchase in our designs. I have been devising with this assumption; have you not?

        1. A. Soroka, many previous conversations have considered "bagging" multiple repository resources in a single bag, such as in the case of "a collection and all of its members".

          1. That's definitely not how I understood that case. I think we need to clarify this in conversation, because allowing multiple-resource bags introduces a massive new front of complexity, as evidenced by that propsed requirement's wording.

            1. I am definitely interested in multiple-resource bags.  I agree that it adds complexity, but many, many repository use cases involve multiple linked Fedora resources that comprise a single conceptual resource.

              I think it will be very important to narrow the scope of the initial work as much as possible, so I'd be fine with multi-resource bags being deferred to a later phase.  But I think it will be an essential feature for many users, including Princeton.

              1. I'm not really even against the assumption-- I just want to get it clear. But now that you say it, I would be happier with a later phase to include multiresource bags. It's a significant complication and I would prefer that we move through this first phase of work to success quickly and smoothly. Then we can build on that.

                1. I think Benjamin Armintor and I might have a path forward with https://github.com/barmintor/bagit-ldp. Maybe this is something we can discuss further, time permitting, on Friday.

                  1. Nick Ruest - I'm not going to be at Friday's meeting.  Even if bags and/or multiple-resources-in-bags becomes out of scope, I'd love to share notes regarding bag profiles (see earlier comment) at some point. 

    2. Yes, this situation only arises if a bag contains more than  one resource, and these resources link to oneanother (e.g. a container and it's children).

  9. My apologies if this was already discussed, but are metadata only updates within scope? Thank you.

    1. For the initial round of this effort, we are scoping to import and export of Fedora Resources and optionally associated Binaries.