Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Feature Community

Feature Steward: TBD
Knowledge Gardener: Daniel Davis
Feature Evangelist: TBD

Member List:

Daniel Davis

Support for Hierarchical Storage

There has been a long outstanding need for support of hierarchical storage in the Fedora Repository and related components. Recent events have underscored that this need has again moved to the forefront with increasing use of the Fedora Repository in research applications. Examples include UPEI's Virtual Research Environment (VRE) deployments using Islandora, Max Planck Institutes eSciDoc (Fiz Karlsruhe) and the emerging NSF Cyberinfrastructure program. Other products such as the DICE iRODS and SDSC SRB have provided a means to virtualize storage including support for hierarchical storage. Increased interest in using Fedora to store high performance computing (HPC) data has been reported from a number of institution. There have also been requests from institutional repository and humanities researchers who have large collections for the addition of this feature.

In this space, we will be exploring the requirements related to supporting hierarchical storage and enabling the community in adding support to Fedora Commons components. The goal of this work is to produce a Hierarchical Storage Module (HSM) integration for Fedora Commons.

Characteristics of Hierarchical Storage

Also known as tiered storage, the driving rationale for using hierarchical storage is the notion that costs can be reduced by storing all or part of collection of files (bitstreams) on lower performing, less expensive storage technologies while keeping a copy of some part of the collection on a high performing, more expensive storage technology for immediate use. The hierarchical storage system implements some policy for determining on what tier files are stored to meet system goals. In real world implementations, many other needs may be considered since hierarchical storage can be used to help with other system requirements such as backup, replication, high availability and disaster recovery. However, it is cost that underlies any decision to deploy hierarchical storage.

The Fedora architecture presents some unique problems and opportunities in supporting hierarchical storage. Even though storage cost is the driving reason for using hierarchical storage, it can provide a solution for other system requirements. We hope in this forum can list those requirements to inform the design for a hierarchical storage integration which fits well as part of the overall Fedora architecture. In addition to the system requirements listed above, the design may need to consider large datastreams, partial reads, updates/versioning, and other features already listed in the JIRA tracker.

Special Aspects of the Fedora Commons Architecture

The Fedora Commons Architecture is most strongly represented by the Fedora Repository. The Repository acts as a spanning (or mediation) layer which encapsulates the way the content is accessed. The architecture is not dependent on the common notion of a "directory of files" which dominates thought on how content is managed including the defacto Web architecture. While applications can still use the Fedora Repository as if it were based on a "directory of files" notion, access to content is virtual and uses dissemination services as access endpoints. There is no guarantee of a one-to-one relationship between a file and a dissemination; dissemination services may be quite complicated. Normally applications should not depend on trying to circumvent Fedora to directly access the file system.

...