Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Current AIPs have too much interdependency. Parent objects (e.g. Collections) enumerate all of their children (e.g. Items). This means that every time a new child object (e.g. Item) is added/removed, it also must be added/removed from all of its parents' AIPs.

Based on discussions below, it looks like we currently have come up with 4 options (at least in the short-term). Feel free to add to these, if you think of other options or pros/cons:

  1. Allow Collections/Communities to enumerate their children (this is how the AIPs are currently formed in the prototype)
    • Pros
      • Makes partial-restores (restoring a Collection/Community) a bit easier – just restore the Collection/Community AIP and it then tells you what child AIPs are necessary to restore
    • Cons
      • Adding a new child object also changes the parent AIP. AIPs are not as independent.
  2. No enumeration of children in AIPs + local AIP parser
    • Pros
      • AIPs are independent.
      • Would work fine when restoring an entire site (or just a single item).
    • Cons
      • Local AIP parser is great as long as AIPs are stored locally. If the AIPs are actually stored elsewhere (whether in DuraCloud or in any other backup solution), then restoring a single Community or Collection is more complex. If the parser is local, then nearly all AIPs may need to be copied to local storage to be parsed – so that it could be determined if the AIP belongs to the Community or Collection being restored.
  3. No enumeration of children in AIPs + remote AIP parser (in DuraCloud, etc)
    • Pros
      • Same as #2 – in addition, now the remote parser can decide which AIPs need to be pulled down locally (so that you only need to copy the AIPs to local storage that you really need).
    • Cons
      • May be DuraCloud specific? Other backup solutions (to tape, external drive, offsite storage) may not be able to take advantage of an external parser.
  4. No enumeration of children in AIPs + a site "index" (which details all relationships)
    • Pros
      • Again, relatively simple partial-restore process (like #1) – In this scenario you just pull down the site "index" file to determine which AIPs are needed to fulfill the restore.
      • AIPs remain independent of one another
    • Cons
      • Could be semi-"proprietary" to DSpace? In other words, would other systems understand this file? But, do we care? If the AIP export is used by someone to migrate to another system, e.g. Fedora or similar, then they would likely be loading all AIPs, and have no usage for the "index" file in any case.
      • Although AIPs remain independent, any changes in relationships (e.g. adding a new object, moving an item) require updates to this "index" file as well – probably, not a big deal, but it's worth mentioning as well.

-------

[15 April 2010] Decision (on 15 April 2010): We (Richard R, Bill H, Tim D) decided that child objects should enumerate their parents (so you can find an Item's parent Collection from that Item's AIP), but parents should not enumerate all their children. Although this may make restoring content more complex (in order to restore a Collection, you need to look at each Item to determine if it is a child of that Collection), it will lessen inter-dependencies between AIPs.

unmigrated-wiki-markup\-----\- *\
[16 April 2010 - Tim \Donohue]* I realized we may need to rethink this decision. If there is no way to determine children of parents easily, than you may encounter the following less-than-ideal scenario when restoring a single Collection along with all its Items:

  • Suppose all your AIPs currently take up 1TB of space. Likely, nearly 90% of that space (900GB) is for Item AIPs, as they tend to be larger and more frequent than Community or Collection AIPs.
  • Suppose you also want to restore a single Collection.
  • Since you know the Collection you need to restore, obviously you can immediately restore the Collection metadata from the Collection AIP
  • However, if the Collection AIP does not enumerate its Items, you will be stuck having to parse 900GB of Item AIPs to determine which belong to this Collection. This becomes even more inefficient if you are using a service like DuraCloud, as it will force you to download 900GB of Item AIPs in order to unzip them and determine which belong to this Collection.

This scenario makes me think we either need Collection AIPs to continue to list all Item members, or we need another way to relatively easily "lookup" which Items belong to that Collection.
-------

Wiki Markup*\[01 June 2010 - Mark \Wood]* It's not necessary to parse entire Item AIPs since they are ZIP archives; just read the manifests. If they are stored remotely (e.g. DuraCloud) then you need to be able to run the parser there and send back the lists of interesting items.

On the other hand, we could extract the relationships into an index for each Collection and package that separately. Relationships are not part of the things related – the difficulty lies in trying to shove the relationship inside any one of the related entities.
-------

Wiki Markup*\[01 June 2010 - Mark Diggory\]* I recommend considering this from the ORE aggregation style "standpoint". what we vaguely concluded a couple years ago is that a DSpace Collection is not an ORE Aggregation because it is open ended. ORE Aggregations are Finite, thus a DSpace Item as an ORE Aggregation will enumerate its children while a DSpace Collection will not. I support the idea of not listing all the child Items in a parent collection AIP or the collection aips within the parent Community AIP. The original behavior of the AIP prototype's ability to reconstitue a repository community/collection/item hierarchy based on the contents did require fully traversing the repository to discover the ancestry of any one Community, Collection, Item, Bundle or Bitstream AIP. Being able to traverse the manifests without actually having gzip archives of content in bitstream will give us the capability to do this efficiently. Perhaps there should be a means within the asset-store to separate the AIP manifests from the rest of the bitstreams so that they may be traversed quickly.

This very much makes me think of both the Fedora Store and the Semantic Store project and how we will address the subject of Entities for DSpace Communities, Collections, Items and Bitstreams. IMO, DSpace 2.0 Entities and AIPs are highly correlated, Where an AIP is an Archival Representation of all or part of an Entity. Likewise, Services in the DSpace Service framework may be seen as different views/subsets of data/state of the content you refer to as an AIP.
-------

...