Proposal for support of versioning in the new DSpace architecture

Initial author: JohnMarkOckerbloom

Later contributors:

Proposal: Items can include multiple versions (to represent revisions; i.e. different intellectual content). Bundles and/or Bitstreams can include multiple versions (to represent reformattings; i.e. different representational content).

Rationale: DSpace is designed to support access to intellectual content over a long time period. Often, however, the content changes over time. We already expect that format support, for instance, will shift over time, requiring migrations from old file formats to new ones. In addition, content may be revised as well, in ways large or small (e.g. a pre-peer-review draft is superseded by a post-peer-review draft; an erroneous and potentially dangerous dosage figure in a medical paper is revised downward; a typo is corrected.) We would like to make it easy for users to find the most appropriate version of a particular work, as well as to group and select other versions when available.

Elaboration: The examples above represent two distinct kinds of versioning:
*Reformatting is essentially versioning of representation, but not of semantic content. This versioning, then, essentially happens at the Bundle or Bitstream level.
*Revision is change to intellectual content, large or small. This is essentially semantic versioning, and essentially happens at the Item level.

Current support: It is possible now to have different versions of an item in a DSpace repository, simply by creating additional Bitstreams and Bundles for reformatted versions, and additional Items for revised versions. However, there is not currently any way to pick out versions, other than convention and perhaps human-readable documentation. This may get confusing for users as multiple versions become more common. There is also no current way to persistently refer to "the latest version of this document", which appears to be commonly desired by users.

Assumptions: In the context of DSpace, a long-term digital repository, we can reasonably assume versions to be linear and infrequent. Versioning is meant for updates of the "same" document, so parallel "versions" of a document, such as a Java version and a C++ version of an object-oriented textbook, would be considered two different Items, not two versions of the same Item. Rapid and parallel versioning, as often occurs in software development, is considered to be outside DSpace's scope and is therefore not necessary to support. (This also means that we don't have to use systems like CVS or Subversion to manage multiple versions.) Even without rapid versioning, though, one can still accumulate a significant number of versions of an Item over time (e.g. the half-doxen or so versions of the JS170 spec that are now downloadable), making "ad-hoc" versioning by creating separate Items awkward, even if we add the ability to put relationship links in the metadata.

Comparison with related work:: epress and Fedora both support versioning. They don't have distinct "reformatting" and "revision" versioning, though; instead epress mostly focuses on "revision" (they haven't had to deal with format migration yet) and Fedora's datastream/dissseminator-based versioning corresponds roughly to the Bundle/Bitstream level versioning recommmended here for reformatting. Other examples of versioned content systems include VMS (where files can be optionally be specified by a version number, with the latest as the default) and Project Gutenberg (which has always incorporated multiple versions, but has in recent years shifted its identifier emphasis from version-specific identifiers to item-specific identifiers that potentially give access to past and present versions.)

Usage notes:

In cases where a version is not specified, default should be the latest version within the appropriate scope. (I.e. default Item version is latest one; default version of a Bundle or Bitstream would be the latest in the specified Item version if any; otherwise the latest in the latest Item).
We probably would want some way to externally refer to a particular version of an item or bundle/bitstream.
Versions may be suppressable or purgeable if desired.
Implementing versions as described above does not prevent a repository (or part of a repository) from choosing to continue to model revisions as new Items instead of new versions of an Item (particularly if we support relational metadata). It would, however, be least confusing for revision handling to be consistent across any given repository (or at least, across any given Community).

Implementation notes: Those with more experience with the nuts and bolts of DSpace implemenation may want to elaborate on what would be involved in implementing this proposal. ut here are a few possible implications:

Some APIs may need to shift to allow version to be specified.
New API calls may be needed for getting version sets and creating new versions.
New Item versions may include many unchanged Bundles and/or Bitstreams. While these would be considered semantically different, a smart storage layer could implement them with a common block of binary content pointed to from different locations, thus reducing the storage overhead of many versions. (I believe current DSpace implementations already do this sort of thing.)
Metadata may shift between versions, but this is pronbably not a big problem if we bundle metadata with the objects; we'll just have it be versioned along with the content. (JS170 seems to allow one to specify both version-independent and version-dependent metadata, but that may unnecessarily complicate the interface, especially for user-defined metadata. It may be enough to just replicate metadata between versions, but we'd still need some general metadata just to keep track of the different versions.)
Due to infrequent changes, we probably can use same storage systems as at present, with addition of version IDs; no need to use diff-based storage as with CVS or Subversion, though if there's a stable API, perhaps interested implementers who wanted diff-based support could provide an alternative storage implementation using the same interface, but based on ditts.

Scalability implications: Multiple versions will take up more space, but this can be controlled by smart use of pointers to recurring content (see above), and by pruning unneeded versions.

Performance implications: Nothing particular comes to mind, other than the incremental overhead of feature-creep. I don't expect there to be so many versions of any particular item that enumerating them would be very costly in time or space.

Security implications: The continued existence of previous versions in the system, and their access rights, should be made clear. Otherwise, someone may try to revise an Item to remove sensitive or otherwise undesirable content, not realizing that unwanted content is still available to readers who know where to look.

Possibly related issues:

Managing metadata along with content (i.e. managed as part of the AIP rather than managed in a separate relational D) would make this easier to implement. That revision to the architecture has been previously discussed in other contexts as well.
We may need to revisit out handling of identifiers to accommodate different versions of Items, Bundles, and Bitstreams.
Previous discussion
VersioningSupport describes a versioning scheme proposed by MIT and HP in 2004. That discussion focused more on keeping separate Items for separate versions, and implementing relational links between them. That would be a smaller change in the architecture that would presumably be easier to implement; however, it might not be as convenient for repository management.
Further discussion
(Comments could be made down here, and/or noted at appropriate spots above. I'm not sure what will work best in practice. We'll probably be disucssing this in other forums as well;
e.g. Dspace-devel and the upcoming face-to-face meeting.)

RobertTansley

I think some concrete use cases will help the discussion here.

1. A PDF document with extracted full text

2. Powerpoint presentation (.ppt file); a PDF conversion, an HTML conversion consisting of several HTML + GIF files

3. Archival-quality video; down-sampled Quicktime, Windows Media versions for typical end users; potentially some extracted keyframes
With format migrations/conversions, versions might not be entirely 'linear', or rather, one representation/set of files may have several derived representations/sets of files, e.g. in the case of 2. above. Is a thumbnail also a 'reformatting' of an image representation? Or extracted full text a reformatting of a doc?

This is not quite the same as branching a la CVS etc as it is probably reasonable to assert that these 'branches' are leaves, never to be merged back into the trunk. However, it's still more complicated than just saying each version supercedes the previous one.

The situation with item-level/semantic versioning (revisions) gets muddier when such converted versions are around. If I update the source PPT, hence creating a new semantic revision of the item, what happens with the PDF + HTML versions? What if they're not automatically created when I update the PPT? Presumably the PDF + HTML versions should not be 'part of' the later version of the object.

So maybe we have "versions" in these senses:

Different objects (items) related in descriptive metadata
Different (semantic) revisions of a single item
Within a single revision, some way of specifying that representation Y was derived from representation X
In terms of IDs, which would item revisions get:
'resolvable' persistent IDs (a la Handles). e.g. we could give versions a Handle like:

`hdl:1721.1/1234/3456/version/4`

but since these look like Handles we'd also need to make these actual Handles, i.e. resolvable, meaning lots more Handles to manage.
essentially non-resolvable but persistent IDs (e.g. info: UIs)
something like ULs which don't have quite the same persistence possibilities as Handles (e.g. http://dspace.foo.com/dspace/version/1524.4/1234/version/4)

How would these versions be presented to external interfaces like OAI-PMH? Likely to vary based on the interface; in most cases, most recent is appropriate.

In terms of scalability/performance, the main question is whether previous revisions of items will get indexed. If previous revisions of objects essentially become a 'dark archive' (i.e. only retrievable by specific ID for it, assuming local policy dictates the previous revision is retained) this will scale with storage; reformatting may need to be handled carefully (e.g. if I create a PDF from a Word doc, no need to extract full text and index from both).

Storage issues:

Where is versioning implemented? Should the storage layer be aware of versioning, or is the storage layer a 'black box', with versioning information, history, policies etc. managed in the business logic layer?
Most storage systems' notions of versioning largely deal with individual files, not the logical objects we are concerned with (items, and representations which may consist of >1 file.) It might be possible to version metadata serialisations as surrogates for the logical object in something like CVS. With JS170 you can version Nodes, and each version can have different children, so I expect a reasonable versioning strategy could be build on top of that.

However, in any case the DSpace business logic layer will need to 'know' about these ideas of versioning, and I find it hard to believe it would be easy to swap in e.g. CVS, S etc. underneath and have the specific advantages of each be leveraged in the DSpace application. Presumably whatever the storage infrastructure, DSpace will need to maintain version information itself together with IDs, or at least be able to infer that information from the storage infrastructure itself.

So our choices here seem:

Have underlying storage component continue to be ignorant of versions, and have it all managed by the DSpace application. Advantage: can continue to use simple file system storage as well as Grid-based etc. and still have good versioning capabilities. Disadvantage: we have to do all the work.
Settle on one storage back-end technology and bake in the way this deals with versioning into the DSpace core code. Advantage: Leverage versioning capabilities of storage back-end (e.g. SVN, L2 JS-170 implementation). Disadvantage: Migrating to/using another back-end would mean the core would need a lot of reimplementing. Would need to keep up with changes to said API/technology. Would need to be careful to pick a back-end that scales to terabytes of storage, SANs, clustered+load balanced servers, HSM, streaming etc.
Somewhere in between, maybe with an API with implementations that have a lot of the versioning logic in them (might get messy).
API issues:
How is it decided when a new version is created? Does it just happen when objects are updated, or do versions need to be specifically created using API calls?
Do we need to create some notion of an 'event', akin to a database transaction, but based more on logical events that HTTP transactions? (I quite like this train of thought; I'll try and hammer out some sort of proposal).
UI issues:
All of the above, despite being quite tricky, is still probably trivial compared to the task of creating a UI that shows and allows users to understand and create the different kinds of versions we're talking about here. If a user tries to helpfully uploads a Word doc and the PDF version of something, how do we get that information out of them? If they upload a report or summary with a bunch of research data, will they need to be versioned (in terms of revisions) separately? Or will DSpace the platform require that the report and data be separate items to be able to manage versioning reasonably?
Other issues:
Maintaining provenance information
Sorry this is such a long reply. ("If I'd had more time, I'd have written a shorter letter")

Page tree

VersioningProposal

Proposal for support of versioning in the new DSpace architecture

Previous discussion

Further discussion