Stakeholders
- Esmé Cowles
- Benjamin Armintor
- Michael Durbin
- Joshua Westgard
- Youn Noh
- Nick Ruest
- Michael J. Giarlo
- Jon Stroop
- Karen Estlund
- Jim Tuttle
Sprints
Sprint 1
Sprint 3
Sprint 4
Use cases
Transfer between Fedora and external preservation systems, such as APTrust, MetaArchive, LOCKSS, DPN, Archivematica, etc
Package[Export] the content of a single Fedora container and all its descendant resourcesTransfer between fedora instances or (more generally) from Fedora to an LDP archive
load[Import]the contents of a packageinto a specified container.Round-tripping resources in Fedora in support of backup/restore
A start has been made on this in FCREPO-1990;
The implementation referenced in the above ticket is not dead, though not actively being worked on at the moment; pull requests welcomed (though others may well wish to take it in a different direction).
A rebuilder that:
Is not solely dependent on a intact backup of the repository index
Works off shredded serializations that can be supported with file preservation techniques
Can recover as much as possible of a repository in the face of integrity issues (supports partial recovery)
Supports gathering copies of the shreds (serializations) from multiple sources to recover a repository
Round-tripping resources in Fedora in support of Fedora repository version upgrades
Batch loading arbitrary sets of resources from metadata spreadsheet and binaries (may well be difficult – or not worth it – to try to generalize such a feature).Import or export containers or binaries using add, overwrite, or delete operations. Configure the data model and the source and the target for each resource that will be updated. Allow target containers to be non-empty before import and source containers to be non-empty after export. Maintain ordering, etc. Support versioning. Examples: add issues to a publication; add fragments to a manuscript; add data sets to a longitudinal study; add time-series images from telescopes; remove resources determined to be under copyright; release resources after restrictions on access have expired.
Perform multiple metadata-only exports, and then restore an earlier version from an export.
Use cases yet to be rolled into requirements
Import objects from an external system (such as Figshare, where a research data object might be prepared) into a Fedora preservation repository with either Hydra or Islandora on top. (Implies compliance with Hydra and/or Islandora object models)
To migrate from internal content to external content, export metadata only and then import it into another repository. The links to the new external content locations would be added afterwards.
Requirements
External Systems
PHASE 2 Support import from and export to a TBD list of external systems.
APTrust - University of Maryland (Joshua Westgard)
Archivematica - Artefactual Systems (Justin Simpson)
MetaArchive - Penn State (Ben Goldman)
Perseids - Tufts - Bridget Almas
General
PHASE 1 Support transacting in RDF
PHASE 1 Support allowing the option to include Binaries
PHASE 1 Support references from exported resources to other exported resources
PHASE 2 Support transacting in BagIt bags
PHASE 1 Support import into a non-existing Fedora container
PHASE 2 Support import into an existing, empty Fedora container
PHASE 3 Support import into an existing, non-empty Fedora container with various policies: add, overwrite, delete, version, skip
PHASE 3 Support export of resource versions
PHASE 3 Support import of resource versions
PHASE 1 Support export of resource and its "members" based on the ldp:contains predicate
PHASE 2 Support export of resource and its "members" based on a user-provided membership predicate
Support recursive RDF insert/updates with LDP Indirect Container specified POST (and PUT / PATCH?) (ref: FCREPO-2042)
Round-tripping
Defined as: Export all or a subset of a Fedora repository and importing the export artifacts into a Fedora repository.
PHASE 3 Support preservation of dates during round-tripping
PHASE 3 Support preservation of version snapshots during round-tripping
PHASE 1 The URIs of the round-tripped resources must be the same as the original URIs
PHASE 3 Support lossless round-tripping. (ie, if you export a resource, delete that resource and import there is no difference from if you had never performed any of those operations).
BagIt
PHASE 2 Single resource bags
PHASE 2 The structure and scope of accepted and produced BagIt bags must be configurable (resource)
Clarification: structure relates to required and optional tagfiles in the bag
Clarification: scope relates to contents of the bag, e.g. single object or object and all members based on specific membership predicate
PHASE 3 Multi-resource bags
PHASE 3 Unambiguously support linking between resources within a bag, and from resources in the bag to resources outside the bag
e.g. for bagged resources A and B, if A contains statement <A> myns:rel <B>, then it is unambiguous that B is a resource in the bag. Suppose some archive ingests the bag and exposes its contents as web resources with URIs P and Q. If the archive preserves intra-bag links, resource P will have statement <P> myns:rel <Q>. Likewise, if A contains external link <A> myns:rel2 <http://example.org/outside/the/bag>, then an archive that preserves links will have <P> myns:rel2 <http://example.org/outside/the/bag>
Verification Tool
PHASE 2 Verify same number of resources on disk as in fcrepo
PHASE 2 Verify same number of resources in fcrepo as on disk
PHASE 2 Verify same checksum for binaries
PHASE 2 Verify same triples for containers
PHASE 2 Record which resources have been verified (Include checksum for binary resources)
PHASE 2 Verify subset of repository resources
PHASE 3 Verify fcrepo to fcrepo
PHASE 3 Verify disk to disk
PHASE 3 Use generated config file as sole input
Considerations
Import/export performance as is possible under the assumption that this work is done via the REST interface
Resources
https://www.ietf.org/archive/id/draft-wilper-semantic-content-pkgs-00.txt
http://dataconservancy.github.io/dc-packaging-spec/dc-packaging-spec-1.0.html (explanation below)
https://github.com/acdha/restful-bag-server (a resource-oriented RESTful HTTP API for exchanging bags)
Meetings
28 Comments
Esmé Cowles
I just want to note that the "Requirements" section above contains features that have been requested in the past, but deemed unworkable or undesirable. In particular, proposals to allow including triples about multiple subjects in a single request, and allowing users to alter system dates have both been rejected before.
Christopher Johnson
I am trying to track down the history of automatic skolemization of blank nodes which may be relevant here. It was discussed in an email thread here and also more recently in the 2016-04-14 tech meeting here. This is an important issue for my use case which involves creating JSON-LD lists from LDP containers that have been indexed into a triplestore. I have just found that RDF lists with skolemized ids where bnodes should be cannot be connverted into JSON-LD lists with the jsonld.fromRDF method because this method identifies the first node (i.e. "the head") in the list with one criteria only: it does not have the index '_:', meaning it is not a bnode. So, maintaining bnodes in FCREPO RDF serialization seems to be important for JSON-LD's understanding of how a RDF collection should exist.
Please excuse my ignorance of the history of RDF collection and bnode implementation in FCREPO. I welcome any links to the current status of this discussion.
Andrew Woods
We have been trying to summarize the state of Fedora's relationship with bnodes, but so far have only gotten as far as:
A. Soroka
There is absolutely nothing in the notion of RDF collections that has any reference to blank nodes. That seems to be a bug in the JSON-LD library you are using.
Christopher Johnson
"Absolutely nothing" may be a bit strong since the link you provided says the RDF Collection vocabulary " is intended for use typically in a context where a container is described using blank nodes to connect a 'well-formed' sequence of items". The method in the JSON-LD library is a brilliant piece of code, imho, and the way that it finds the beginning of a list is valid, though it is limited in scope. The substantive problem seems to be in the RDF Collection vocabulary which while it provides an end terminator with rdf:nil, does not provide a decidable beginning terminator since rdf:first is applied recursively. The algorithmic problem is how to find the head of a list with a reverse iteration if there is no terminator for it?
A. Soroka
No, "typically" is not language that enables a library to rely on that assumption. The parser should be able to understand concrete nodes in those positions correctly.
Martin Haye
Christopher,
I'm totally sympathetic and also love JSON-LD. I was in the same boat as you and very frustrated. Luckily A. Soroka talked me down from the ledge and I've gone on with life using hash-URIs for ordered lists instead.
My understanding is that JSON-LD's lists, which translate down to RDF lists, are problematic because they rely on blank nodes, and those have no defined lifespan beyond the post in which they are made. I guess they'd need some complicated garbage collector to be implemented in fedora, and that's no fun for anybody.
I've been really meaning to write down some different ways to represent ordered lists, using hash-URIs on the page Andrew pointed to. They're not quite as pretty as lists but not nearly as ugly as I'd feared. I'll make a point to get to that soon, and my apologies for taking a long time.
Christopher Johnson
I agree that bnodes should not be implemented directly in an LDP repository and representing them as skolem URIs is correct. In the RDF 1.1 spec on "Replacing Blank Nodes with IRIs", a suggestion is made regarding minting skolem IRIs with a reference back to their origin as blank nodes. Making an (import restricted) bnode reference "list type" skolem could then be useful when indexing collections to a triplestore or exporting/serializing as RDF, where a "round-trip" would (somehow) map to an arbitrary blank node reference. A bnode in a list functions just as "meaningless" placeholder, and as such is not really metadata, and thus has no need for a lifecycle. This seems one reasonable way forward, because I do not think that avoiding the relevance of blank nodes for major implementations such as RDF collections is rational.
A. Soroka
I disagree violently with the characterization of RDF collections as a "major implementation". Otherwise, I invite you to be a little more specific than the term "(somehow)" you use above in referring to how this idea would work. Ideally, as specific as a code contribution.
A. Soroka
Just to be very clear (this point seems to be getting lost) RDF lists do not rely on blank nodes. Blank nodes are certainly the most common implementation pattern for them, but there is no requirement to use blank nodes.
Otherwise, Martin Haye is quite right to describe the implementation details of blank nodes as complicated. That is actually rather an understatement. The tech team discussed this extensively, investigative work (referred to on-list) was undertaken, and the effort was abandoned. I cannot see what has changed in any way to make any difference at this point in time.
Esmé Cowles
I'd like to second what A. Soroka said here. I have long been a proponent of providing at least some support for blank nodes in F4, and have participated in the discussions and implementation over the last few years. The current implementation skolemizes blank nodes on input, and also provides some support for hash URIs (which in the current implementation are stored as child nodes of the main resource). I believe this functionality supports all of the documented use cases, including ingesting metadata (like rdf:List, or JSON-LD with lists, or MADS) that typically use blank nodes, and stores them in the repository in an intelligible way. It's true that when the RDF is retrieved that it no longer has blank nodes, but I would point you in the direction of API-X if you wanted to change that behavior.
There have been a few attempts to add additional functionality in this area, but none of them has been carried to completion. IMHO, there is a pretty high bar for suggesting changes in this area, and a general skepticism that additional functionality can — or should — be implemented.
That said, I'm happy to help anyone figure out how to use the current functionality, and to improve the documentation if it's not clear.
Christopher Johnson
I think I may be able to offer a code contribution towards a solution to the general problem of "Exporting" Skolem IRIs. This still requires some research on my part into the specifics of your implementation. I see several general work areas in my cursory evaluation, and perhaps we can move this technical discussion to another forum? What I would like to do is basic in principle. I think that it is possible to do this in the context of fcrepo-exts rather than the kernel, which perhaps makes it a bit less difficult.
First, is it accurate to assume that all nodes with the type fedora:Skolem originated as blank nodes? If so, then it is a question of identifying and transforming the subjects of this resource type for the indexers with some extension to fcrepo-transform, and using a "bnode generator" in a transform configuration. The bnode generator could maybe extract/transform the genid to ensure some consistency with the skolem node. Note: the triplestore indexer does not currently use a transform configuration, but the solr indexer does, so fcrepo-indexing-triplestore would need to be touched as well. Just some preliminary thoughts, here.
I have to assert again that, unfortunately, the current concrete skolem node functionality does not support my "Export" use case. I have no problem with the way it works for importing. This is not because it is not possible to parse RDF Collections constructed with concrete nodes, but because a list of constructed with bnodes is JSON-LD's (justifiable?) current expectation. I should also add that their normalization method "_listToRDF" also generates bnodes for all JSON-LD "@lists". The choice seems to be try to change JSON-LD expectations or try to match them. Since I do not have a proposal to counter their assumptions, the latter seems to be more doable. Furthermore, the concept of the "@list" is a major part of JSON-LD as all arrays are equivalent to RDF collections when canonicalized. This seems to elevate the importance of the rather arcane use of RDF collections now.
A. Soroka
Thank you for being ready to step up with a code contribution. That is decidedly admirable. You are certainly right that this is not the right place to discuss such matters in detail. That place is the tech mailing list: fedora-tech@googlegroups.com. Please introduce this discussion there-- I know we'll get a much broader range of input and helpful advice.
Martin Haye
As promised, I have documented a variety of different ways to represent ordering using hash-URIs (and not blank nodes). I'm fairly new to this game so an extra pair of eyes wouldn't hurt, looking for rookie errors on my part: Ordering
This documentation is totally unrelated of the effort to improve skolemization of blank nodes, which sounds good to me at least in theory.
Aaron Birkland
We've done some work with packaging linked data in bags at JHU, which could possibly be of interest to this effort. The driving use cases are a little bit different (i.e. focusing on tools generate bags of linked data that can be subsequently ingested into a Fedora instance), but it could be useful to look at.
Here's a description of our particular approach, and a sample bag generated by a tool which packages up content on one's file system, and allows editing of metadata. A service then ingests the custodial content of the bag as Fedora resources.
Some points of interest of the bag content:
Notes on ingesting the bags
Michael J. Giarlo
A. Soroka mentioned the semantic content packages draft in IRC today and I thought it worth adding to the Resources list above: https://www.ietf.org/archive/id/draft-wilper-semantic-content-pkgs-00.txt
A. Soroka
It's not clear to me how this situation arises if we have a one-to-one bag-to-resource match.
Elliot Metsger
Hi Adam, I'm not sure I understand what you mean by "one-to-one bag-to-resource match".
In this case <A> and <B> are resources within the same bag. So the uri for <B> would resolve a resource within that bag. And if resources <A> and <B> were deposited into fedora, then (ideally) the relationship as it was inside the bag (<A> myns:rel <B>) would be maintained - (perhaps with new URIs as assigned by fedora) as <P> myns:rel <Q>.
We opted to define a new uri scheme (bag://), and used it when referencing resources within a bag.
A. Soroka
Elliot Metsger, we must be missing a beat between us, because the condition "one-to-one bag-to-resource match" seems clear to me. It implies that for each resource (for our instant purposes, each thing addressed by a repository URI) there is written one unique bag, and in each bag, just one serialized resource. It follows that there is just one thing serialized in a bag that can be addressed by (one) URI and the idea of "bagged resources <A> and <B>, both in one bag", obtains no purchase in our designs. I have been devising with this assumption; have you not?
Andrew Woods
A. Soroka, many previous conversations have considered "bagging" multiple repository resources in a single bag, such as in the case of "a collection and all of its members".
A. Soroka
That's definitely not how I understood that case. I think we need to clarify this in conversation, because allowing multiple-resource bags introduces a massive new front of complexity, as evidenced by that propsed requirement's wording.
Esmé Cowles
I am definitely interested in multiple-resource bags. I agree that it adds complexity, but many, many repository use cases involve multiple linked Fedora resources that comprise a single conceptual resource.
I think it will be very important to narrow the scope of the initial work as much as possible, so I'd be fine with multi-resource bags being deferred to a later phase. But I think it will be an essential feature for many users, including Princeton.
A. Soroka
I'm not really even against the assumption-- I just want to get it clear. But now that you say it, I would be happier with a later phase to include multiresource bags. It's a significant complication and I would prefer that we move through this first phase of work to success quickly and smoothly. Then we can build on that.
Nick Ruest
I think Benjamin Armintor and I might have a path forward with https://github.com/barmintor/bagit-ldp. Maybe this is something we can discuss further, time permitting, on Friday.
Aaron Birkland
Nick Ruest - I'm not going to be at Friday's meeting. Even if bags and/or multiple-resources-in-bags becomes out of scope, I'd love to share notes regarding bag profiles (see earlier comment) at some point.
Aaron Birkland
Yes, this situation only arises if a bag contains more than one resource, and these resources link to oneanother (e.g. a container and it's children).
Youn Noh
My apologies if this was already discussed, but are metadata only updates within scope? Thank you.
Andrew Woods
For the initial round of this effort, we are scoping to import and export of Fedora Resources and optionally associated Binaries.