You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Notice

This tree is the DfR 0.3 System Documentation.

Introduction

The significant increase in the production and collection of scientific data is appropriately referred to as a “data deluge.” Researchers and scholars confronting this deluge are faced with new data management challenges that have both social and technical dimensions. DuraCloud for Research (DfR) will reduce the impact of the data deluge on researchers — enabling them to concentrate on their real work — by automating many of the rote activities involved in handling digital research material. But DfR also has the opportunity to do much more since "value added" services can easily operate on data when those copies reside in a "cloud." It is the work of this project to analyze researcher and research institution needs, in order to build software and offer services tailored to meet those needs. DfR will:

  • Support backup (via DfR's UpSync Service) from the researcher's infrastructure to the cloud (DuraCloud) early in the research process
  • Enable duplication and integrity checking in the cloud to ensure uploaded research files are kept safe assure durable access (using DuraCloud's services)
  • Create rich objects for the uploaded research materials, using the DfR's Rich Object Creation Service (ROCS), to automatically enrich the metadata to help build search indices,  and provide "value" added services leveraging the cloud
  • Build a Service Execution Environment (SEE) — an integration framework and platform --- of well supported "off-the-self," free open-source software like Spring, ActiveMQ (JMS) and Camel using well accepted integration patterns and methods
  • Enable the SEE to call any service in a hybrid architecture where data or services can reside anywhere on the network including the researcher's infrastructure; starting with the OCS implementation
  • Ingest the rich objects into the Fedora Repository, referencing the uploaded research material in DuraCloud, and storing the rich objects there too
  • Demonstrating one possible integration of DfR — to validate the DfR's technical approach — using SIdora (Drupal-based) to graphically show the research materials and functions such as navigation, organization, editing and discovery 

From a social perspective there are fundamental open questions pertaining to roles, responsibilities and processes for responsible research data management. In the past, it was the responsibility of individual researchers (or teams) to manage their research materials, including both data and related context information, while it is an active part of their work, then publish the results in the form of papers or books. Original research materials supporting the publications were rarely made available, and if they were made available, did not remain so for very long. Less often mentioned is that information for current research is also commonly lost, hard to find or hard to share with research partners, especially in a secure manner. The magnitude of the problem is such that many funders, such as the NSF, now require a data management plan be prepared and executed for any new project. The results of lax data management practices are well documented and often include:

  1. loss of active research data
  2. difficulty verifying research results
  3. limited ability to perform repeatable research
  4. limited ability to reuse or re-purpose research

From the technical perspective, there are currently many types of systems in use for data management including an array of non-­standard systems, grid‐based storage networks and, to a lesser degree, institutional repositories such as DSpace, Fedora, and enterprise backup solutions from various vendors. Even within a single institution, there are often no standards for storing data, resulting in ad hoc approaches and variability across departments and individuals. Unfortunately, there are also “data under the desk” approaches, where researchers and scientists go it alone by storing and managing valuable data in their personal computing environment with commodity computers and local storage devices. Many institutions provide enterprise backup solutions that, while highly encouraged, do not, by themselves, fulfill all the emerging data management needs for research.

Most discussions surrounding data management for research focus on either handling the large volumes of raw data collected from large instruments and facilities, or where data goes to die — the archive. However, this project was conceived based on the notion that there is a circular life-cycle for research information that begins with active research projects through publication to archival and back to reuse again in active research projects. Indeed, this is how new research builds upon earlier work. Solid research data management infrastructure and practices provide support for the needs and interests of all the participants in this life cycle. Preservation is usually not a priority for researchers, but reuse promotes preservation. If preparing the research information for reuse can be done with little or no effort by the researcher, the ability to preserve it is greatly enhanced. If the infrastructure helps the researcher perform research, the more likely it will be used.

It is the goal of DfR to provide data management software that helps the researcher while also supporting the archival and re-use phases of the research information life-cycle.

The life-cycle begins with the researcher. It is clear that the researcher is primarily concerned with performing and publishing their research. Anything, particularly data management infrastructure and practices, that gets in the way of the researcher's goals is not acceptable. However, the need for a research data management infrastructure is clear, both to handle the "data deluge" and to satisfy new mandates being levied for managing the resulting information.

DuraCloud for Research (DfR) is a new project, with development initially funded by the Alfred P. Sloan Foundation, to help researchers (and research institutions). DfR will be offered both as open source software and as a managed service by DuraSpace, the non-profit organization whose mission is to provide preservation, archiving, and access solutions for scholarly, cultural, and research data by supporting community-driven, open source software projects.

DfR takes the view that the infrastructure must enhance the research process and make the data management component of the researcher's work easier. By supporting the researcher (the active research phase of the life cycle), we enable processes that happen in later phases like archival management. If we concentrate primarily on later phases, the infrastructure will not be used by the researcher. Many new information technologies have become available (even mature) that makes the development of such an infrastructure feasible. For example, there are search tools, feature extraction and transformation tools, tools for orchestrating and deploying services, new security/policy technologies and, in particular, the emergence of cloud-based storage and compute services. By integrating a number of currently available components (and services) and developing a few new key components, DfR can offer an evolvable, extensible infrastructure for serving the needs of both researchers and archivists.

DfR Development Principles

What is part of a good infrastructure for researchers? At this time no one knows. We also don't thoroughly know the needs for research data curation and, while we can draw from the extensive body or work performed for libraries and records management, we must also be careful not to draw false analogies. We can benefit from the experience of a number of current and past projects. We can also draw from the design of successful infrastructure. But, to a great extent, we have to build parts of the infrastructure and put them before DfR users. To manage development risk we have adopted the "Minimum Viable Product" (Instead of links, let's do footnotes) (we can in hard copy) (methodology from "Lean Engineering" practices widely in use by software companies such as Google and Amazon. From the "lean" approach, we are integrating "structured learning" practices to guide us. We use "agile" project management and development methodologies to produce quality products while supporting redirection as we learn.

 DfR is guided by a set of core development principles that are adjusted through a structured learning process.

Three core principles quickly surfaced from our advisors and workshops along with one key software design principle. DfR must:

  • Provide benefits to current research or DfR won't be used, causing every other goal to fail
  • Support heterogeneity in all aspects since no single set of standards and practices will support all users, especially in the long term
  • A hybrid infrastructure is required since computing and data may be located anywhere
  • Manage complexity or the infrastructure will not succeed

An initial set of key services were identified (and are described in more detail in the following sections):

  • Services and the Integration Framework - Provide a service execution environment (platform) to automate DfR and researcher tooling
  • Backup and Restore - Capturing a copy of the research material early in the lifecycle is critical and supporting "backup" is the best, first step
  • Access and Discovery - Ubiquitous, long term access is required and expected in today's private and public networks including the ability to find data anywhere
  • Security and Policy Enforcement - Access to data (and services) must be secure, especially pre-publication, and shareable when permitted
  • Scalable and Durable - The solution must be scalable and elastic to support changing needs and uses while automating support for integrity, provenance and preservation-enablement
  • Transparent but Managed - DfR must be a transparent part of the normal working environment of the researcher and able to run services anywhere, but also be managed (to the minimal degree)

To support these services the DfR architecture and its implementation must:

  • Take a service-based approach to manage complexity while providing flexibility and enabling "unanticipated" uses
  • Use off-the-shelf software and services as much as possible for the core and concentrate on enabling integration rather than building new software
  • Utilize data (be data-driven) as much as possible rather than putting functions into code
  • Use an integration framework as the core upon which services can be implemented or robustly connected
  • Use a "Convention over Configuration" (CoC) approach with supported patterns, particularly "Enterprise Integration Patterns" (EIP)
  • Provide a service execution environment that permits using service orchestration appropriate to the applications needs
  • Permit many ways to add and access research materials as well as many ways to perform services (compute) over the research materials
    • make it location independent (many on-ramps and off-ramps)
    • make it easy for the researcher to add innovative services to operate on data accessible to them
    • make it easy to share services of common utility

At this stage of this work we know less about "value added" services for DfR. We know that some form of these these services (listed below) are needed to make DfR a more useful platform. Some could be provided by researcher tools and, possibly added to DfR if they are of widespread interest. Others are customizations of existing products. And there are overlaps and synergies between them. But, these services show some of the promise that can enable DfR to become a general purpose platform for support the entire research information life-cycle. We plan to incorporate some of these services, in simple forms, to support structured learning, informing what researchers want and where investments will provide the best return. These services may include :

  • Data Discovery
    • Search over a researcher's data
    • Data accessible to the researcher
  • Feature Extraction
  • Custom indexes including search indexes
  • Collaboration tools
  • Visualization tools
  • Virtual Research Environments

The remaining parts of this document contain brief overviews of key aspects of DfR. In subsequent documents we will provide more comprehensive, technical and detailed information.

Services and the Integration Framework

DfR provides a flexible, simple, configurable service execution environment based on common, off-the-shelf technology that can be used for all kinds of services.

To understand DfR, it is best to start with the service framework and the core services. To be successful, DfR needs a flexible, configurable but simple way to perform services. It also needs a set of core services that perform the most basic functions. DfR is focused on integration, not attempting to "implement it all" as one monolithic product. And we chose to use widely-used, well-supported, free-open-source, off-the-shelf products for much of the core rather than try to re-implement what has been built before. The architecture is optimized to simplify the process of adding services, thus enabling communities to be built around popular services and allowing researchers to invent new services without having to do all of the data management plumbing.

Using the principles of Service-Oriented Architectures (SOA), DfR primarily implements or uses services conforming to the Web architecture and REST (Representational State Transfer). SOA best practices dictate that a service should do one thing (to keep it simple and permit reuse). Recently the term "micro-service" has become a popular term but a service is a service regardless of name. DfR also supports heterogeneity in processing as well as data, any service technology can be plugged into DfR. To make DfR pluggability easy, it also encourages the use of "Enterprise Integration Patterns" and a principle called "Convention over Configuration." It is difficult to create a software system which is both easy to use and flexible; but these two concepts, if applied, help to reduce complexity.

There is one more concept that deserves special attention, "The Truth is in the File." A loose interpretation is needed here since a number of variations to this theme are being used. For DfR, everything gets reduced at some time to a "serialization" a.k.a as "bytestream" or the contents of a file. We try to put as little logic in code as possible and instead attempt to ensure that actionable data is resident within the serializations. This includes the research material (data), metadata, policies, events, history — pretty much everything.

The components of DfR are shown in the Architecture Sketch. Below we present a walk-through since each of the components work to provide more than the sum of the parts.

  1. Since DfR is a "hybrid" architecture (a system-of-systems), two major sections are shown:
    1. The Researcher System (which could overlap the institution's)
    2. The DfR system (which could run in any cloud but is demonstrated using DuraCloud)
  2. The DfR Monitor and Sync service (software) also called "UpSync" is installed within the Researcher System (individual's or institution's)
    1. Source Data (usually files) are created/modified/deleted in the Researcher System
    2. The DfR Monitor and Sync service "notices" changes based on policies set up as part of its installation
    3. Copies of the source data are sent (encrypted via SSL in transit) to DuraCloud leaving the originals alone
  3. DuraCloud stores each copy as a simple object that contains both the data and associated metadata (about the copy)
    1. DuraCloud creates fixity information such as a checksum and make additional copies as needed for safe storage
    2. DuraCloud sends messages to any service permitted to subscribe (publish-subscribe) to its feed that there is new information about the copy
  4. The Rich Object Creation Service receives the message
    1. The ROCS determines what should be done from information in the message (creation messages are demonstrated)
    2. The ROCS fetches a copy of the DuraCloud simple object's metadata and it can get the data too if needed
    3. This begins a series of orchestrated calls to services that we provide, or that others can provide (any network service can be called)
    4. The primary goal is to create a Rich Object that contains enhanced metadata and relationships to other objects (rich or simple)
    5. At the end the Rich Object is assembled in a form that can be ingested by the Fedora Repository
    6. The ROCS sends a message to the Fedora Repository (in this case an ingest message)
  5. The Fedora Repository receives the ingest message
    1. It performs the ingest, validating the rich object and storing it as FOXML on the repository's "cloud" fast filesystem
    2. Since the data is already in DuraCloud, the rich object contains only a reference to it and has no need to copy the data again
    3. Fedora indexes relationship information in its Resource Index
    4. Fedora sends a message to subscribers (in this case) that a new object was ingested (for example Solr search)
  6. The Fedora CloudSync Service periodically, based on policies, copies the FOXML object from the repository's fast filesystem to DuraCloud for safety
  7. Fedora provides its APIs, services and disseminations of the rich object metadata and the data to any consuming service
  8. For demonstration, we integrated SIdora (not pictured) as a graphical (VRE) front-end to Fedora

We called this demonstration the "hook shot" but many on-ramps and off-ramps can be accommodated.

The code must run somewhere. For DfR, this centers on the service execution environment (SEE). It is built from a number of well supported, readily available open source products and takes a standards-based, enterprise approach wherever possible. The SEE is primarily concerned with running services in a well managed fashion. Services may run directly on the operating system, application servers, OSGI containers or on external platforms. Its the job of the SEE to loosely manage service integration — their execution and the data passed between them. The SEE may be described as an integration framework which is compatible with many other products and can be used to integrate custom tools from any source, including those provided by a researcher. Put simply, the SEE provides the framework necessary to be able to plug-in and organize the execution of services. Services may be executed directly but event-driven operations through messaging is the preferred integration method.  For scalability, asynchronous operations resulting in "eventual consistency" (BASE) are preferred over rigid transactional methods (ACID). Services can run outside the SEE but services running as part of the SEE benefit from being managed with policies and a monitoring environment within the system. Also the SEE is not one provider system, but a system of systems (hybrid) that can include parts of your research infrastructure. There is much work to do, but much of the core has been implemented using currently available technologies which can be extended later for new uses are they are conceived.

The list of products currently integrated into the 0.3 DfR demonstration are too numerous to include within the scope of this overview but some current important ones include:

The first use of the service execution environment in DfR is in the implementation of the Rich Object Creation Service (ROCS). This service receives notifications from DuraCloud via JMS indicating that new or modified files (simple objects) have been backed up. The service then creates rich objects to encapsulate these files. ROCS could listen to any notification service but the current implementation only supports DuraCloud. Simple objects are like files in file systems. They are containers for some sort of serialized content (bytestream) and small amounts of appropriate (meta)data like "filename", "size" or "creation date." Defining these terms are tough since we often use terms like "file" to mean both the container and the content, but for simple objects "a file in a file system" is pretty close (DfR uses "Datastream"). DuraCloud also stores simple objects.

For DfR, we really need much more context information about the "files" including how "files" are related to each other. Metadata is most commonly used but it is also one of those slippery terms since "one persons metadata is another persons data" (DfR uses "Context Information" for the idea). The goal of the ROCS is to create rich objects that link context information with newly backed up files (and also previously backed up context information, files and related external information). ROCS can:

  • Determining its format (type) to any desired degree of precision if we have its signature
  • Using its format to extract metadata saving the researcher for time consuming manual inputs
  • Recording its directory and keeping its history
  • Looking into the file to extract data that you can use in your research (like the geographic location from at GPS enabled camera)
  • Linking files that are related even constructing virtual organizations of the files
  • Subsetting a file or set of files given the right service
  • Extracting interesting features from the file
  • Enabling you to plug in your own services and use the cloud (close to the data) to perform your own analyses
  • Calling your own services or using service managers (like Kepler and Taverna) or, if you permit, letting DfR call your services

Rich Objects are subsequently ingested into the Fedora Repository. The Fedora Repository is a mature product that is known for its ability to be a "chameleon" and integrate with many applications. It already has a rich and vibrant community. DfR does not limit repositories that could be used; any repository can implement an event listener, accept the data package and convert it into a format it can use. But Fedora provides an excellent way to demonstrate the value of rich objects and several ways to access them and the relations between their parts.

Here is the tricky part – anything can be serialized. We can take a rich object and serialize it into a simple object and store it. We can also deserialize it back into a form that is easier to use. Fedora is designed to do just that. In DfR, Fedora is used more as a mediator than for its usual managed storage but instead everything is stored in DuraCloud. A big advantage of this approach comes when large research data files are uploaded into DuraCloud. We really don't want to move them around unless absolutely necessary and, if we need to fetch them, its more scalable to get them directly. Both Fedora and many operations of the OCS just point to the uploaded file saving considerable processing cost. Of course, service (and applications) are able to keep copies or indices that make operations faster — as long as everything critical to the sustained operation of the system is written to a serialized form with reasonable time based on your risk tolerance.

Use of an SEE is a well proven integration approach (and can be implemented by many different product sets, many different ways). It works by being concerned about sending the right information to the right service at the right time. It is best if two (or more) services are not mutually dependent. Then and two services can be integrated if we can transform the information output of one into the information input needed by another. DfR cannot support every possible integration at the beginning but the approach makes it possible to add them over time, permitting a graceful evolution. In fact, we anticipate that users (researchers, data archivists) can use the SEE to develop their own services and plug them in. Hopefully, the best will be contributed by them for others to use. We did not invent this integration approach; it is well proven and in wide-spread use. DfR is working to apply it to serve the researcher and other actors in the research information life-cycle. By applying this approach, DfR will be much easier to use and more open than similar efforts. DfR is focused on enabling innovative integration communities, not producing the domain specific tools ourselves. This approach can, in the future, also help DfR to capture the process used to perform the science as well as the data and context information.

One of the great barriers to preserving research information is that providing context information manually is a laborious process usually performed by a research assistant who doesn't want to do it. Much of this information is already in the research data and associated materials. With the right tools (SEE, ROCS and services) we can substantially automate this process when making rich objects. Search engines and commercial services like the "Mendeley Desktop" are already performing similar functions but their technology is usually private. DfR offers a place to run services in an open source framework, hopefully forming communities around popular ones. As a proof of concept, DfR 0.3 incorporates a mime-checking service. In future versions, DfR will integrate a series of characterization tools (the first stage in assembling context information) as part of a EIP called "Content Enricher" to populate selected metadata forms. For example, adding "DROID" would be a good choice. Our analysis has shown that there is no single tool that can characterize every kind of content. This situation is even more acute for research data. The "Content Enricher" can look both at metadata and into the data. For example, it could use an image feature extraction tool to collect metadata and recognize elements within a JPEG. EIPs can even call remote services to provide information. Use of "Content Enricher" and other EIPs give the DfR enormous power and flexibility to automate the tasks for the researcher.

Context information is key to repeatable, verifiable research. It is also necessary to help manage the researcher's data. The very same sort of information is essential for an archivist to provide stewardship for long term preservation of the research. One of DfR's goals is to enable this by providing tools the researcher wants to use because they make work easier and better. The ROCS implemented in the SEE is DfR's "swiss army knife" for adding value to backed up data files.

For demonstration, we integrated a specialized implementation of Drupal called SIdora. SIdora is a development of the Smithsonian institution in conjunction with Discovery Garden and the University of Prince Edward Island. It provides a Virtual Research Environment to its researchers. It is built upon Islandora, an open source product of Discovery Garden and the University of Prince Edward Island. We could have used any number of "front end" applications but SIdora has features that enable a very interesting integration for a proof of concept. Foremost, SIdora uses Fedora to store objects and integrates with Solr for search/discovery making the integration by services easy. But it also provided much, much more acting as a visualization, navigation, policy enforcement and manual editing station.

SIdora is able to share Content Models, a machine readable specification for the data that should be found in a class of Fedora objects. Combined with other models, they drive the forms you use for display and editing of metadata in the SIdora interface. Fedora and SIdora can also share policies such as those used for authorization as described below. Like DfR, SIdora incorporates its own SEE to automate functions like producing "thumbnails" for itself, demonstrating the integration of DfR as a "system-of-systems." While SIdora is not a core component of DfR it a well respected friend.

Backup and Restore

DfR (with DuraCloud and other members of the research community) starts with an enterprise-class cloud backup service that transparently makes copies and then helps you seamlessly connect any resource or service of your choosing that can advance your work. DfR uses the same services to securely share project data with trusted assistants, collaborators and colleagues.

DfR's first task is to automatically backup research information (data and related research files) stored in digital form to a remote location. Backup benefits the researcher if only to avoid loss of working materials, and many are familiar with its purpose. Unfortunately, the benefits of backup are not highly motivating to the researcher. The largest advocates of backup have been actors in later parts of the research information life-cycle notably data archivists. Increasingly, the value of this data has been recognized, for example in Genome research. But since researchers want to concentrate on their work (and for now are mostly judged by their recent output), backup is often neglected unless mandated by the researcher's institution or funders. Typical researcher needs we found include:

  • Sharing with trusted research assistants and collaborators
  • Checking the integrity of the data
  • Web publishing (to augment journal publishing) parts you select
  • Using remote compute resources to analyze it, cheaply or paid by others
  • Capturing data from disconnected sources (laptops and instruments)
  • Making sure data does not leave when a person moves on to another position
  • Sharing your life's work as a legacy for preservation

Most larger institutions provide classic enterprise backup but many researchers don't make use of it. Enterprise backup is a challenge to smaller institutions and for everyone in our increasingly mobile computing environment; making backup both more important and harder to accomplish than ever. Many researchers are very self-directed and protective of their work, often uncomfortable with institutional intrusion. It would be much better if there were immediate, visible benefits to the researcher resulting from backup than just fulfilling mandates. By starting with backup, DfR is able to enter early in the research information life-cycle. This provides an early starting point for capturing context information about the research, essential information for re-use, provenance and integrity. Users later in the life-cycle would benefit automatically. While backup is of little interest to the researcher now, the same information can be used to help the researchers find their own materials and help them organize their materials. With modest additions, support for doing research (discussed below) — particularly indexing and correlation — can be added. This is the first and most clear case where providing service to the researcher (with little effort) up-front enables downstream services later both for researchers and other users.

DfR has the potential to enable immediate benefits by exploiting improvements to our modern computing environment. With respect to backup some of the most important changes include: near-ubiquitous networking, public and private clouds, file collaboration tools, social networking, Web identity, synchronization products and Internet backup products. Right now, products driving these changes are often optimized for specific purposes — particularly file collaboration versus backup — and rarely integrate well. Researchers are using these products already, often avoiding their institution's infrastructure, but providing a natural integration path for DfR. There is a convergence of these technologies happening that DfR can exploit, but, introduction of DfR must be gradual and backup, though not the most exciting feature, is the essential starting point followed by integration with file collaboration products.

Backup must be simple to set up and automated in its operation, or the researcher is less likely to accept it. We have good examples in products like Carbonite or CrashPlan. These products are converging into classic commercial enterprise backup products like Zmanda, Tivoli and Legato (or Amanda which is free open source). DfR is best focused on finding and integrating with backup tools that permit it. Since DfR is free open source (FOSS) it has no stake in any vendor lock-ins.

The commercial products mentioned above are all good, indeed DfR expects to be able to work with those that permit it. However, the commercial products are unlikely to:

  • Making it easy to integrate with any service you want since they have commercial interests
  • Allowing you to see and modify their code, possibly use their protocols, since they are not open source
  • Being acceptable to funders or institutions requiring a data management plan and compliance
  • Providing a way other than disk/file copies to fit into your research infrastructure
  • Automating getting extended context information other than just the files themselves (more on this later)

In particular, DfR backup is focused on enabling the collection of research-related context information since every backup tool can copy files. DfR backup is best described as "enterprise-class cloud backup" as shown in this article. It assumes that a hybrid infrastructure is used in the early part of the life-cycle where the researcher (and/or institution) keeps one or more copies and DfR does also. This approach, generally, leaves the master, working copy within the researcher's infrastructure (though that may be a cloud collaboration product). We expect hybrid approaches to be used for the foreseeable future for a number of reasons but particularly for creation, use of optimized research infrastructure (include supercomputers, instrumentation, and higher performance with lower cost of LANs over WANs. At this time, DfR does not offer all the features that traditional enterprise backup systems offer but is complementary to them, particularly in offering ways to exploit private clouds between institutions and the public clouds. DfR offers a remote backup for both the research and the institutional backup. Disasters happen, and can encompass an institution, even a region. DfR offers the means to achieve as complete a distribution as you may require.

DfR provides backup with its "UpSync" software. UpSync monitors one or more file systems for changes and makes a remote backup copy. To start, UpSync must be installed somewhere in your computing infrastructure, on a client or server system. Upon first start it backs up all the files in the selected file systems then sets a background monitor. At this time UpSync makes copies only to the remote server; DfR provides a separate restore tool to avoid overwrite of active research files. UpSync can be used with any file system your system can access which can include your local Google Drive and Box files (or similar collaboration tools) if you want. Since, many users and institutions are familiar with installing simple commercial backup utilities such as Carbonite or CrashPlan, and collaboration tools such as Box.com or DropBox, installing UpSync should not a barrier. Actually, many institutions are offering commercial backup or collaboration products with free support from IT. UpSync is different in that it is a modest beginning for backup products that can collect research context information. It is entirely under the control of the researcher or institution both in its design and its transparency as FOSS. It collects the material under settings, including security, controlled by the researcher. UpSync can be used in combination with other backup or collaboration tools but, while satisfactory for capturing the files, this approach currently reduces DfR's ability to capture context information because other tools are designed only capture the files and directories (with a little file metadata) and do not provide a way to capture additional context information. UpSync currently collects extended file and directory information, user information, synchronization history and passes it back in a properties file that can be processed to populate common metadata schemas. It supports enrichment of metadata by using information from similar objects, projects or directory structures. It can be extended to collect more information from the researchers environment especially from automated tooling. For example, one Smithsonian research group is digitizing thousands of images that they place in a pre-defined directory structure. Metadata population tools and forms are associated with this directory structure so the metadata can be automatically populated. This has also helped the Smithsonian researchers think about the way they organize their data and to keep it organized more rigorously. Using a tool like UpSync, that can instrument backup to enrich metadata early in the research life-cycle, is needed to automate capture of context information.

Above, we made a distinction between backup and collaboration tools. Since there is an overlap of features, we use backup to mean making one or more copies of a specified set of digital materials for the purpose of safekeeping. And we define collaboration as offering a sub-set of copies for sharing or collaboratively authoring original digital materials. The difference is in purpose but products are usually are optimized for their purpose. We don't want to over-emphasize the differences, rather we want to integrate tools commonly used by researchers and have designed DfR to do that.

DfR assumes that research materials have a life-cycle and tools can be designed that offer different features depending on the point in the life-cycle. This can be controlled with policies set by the custodian of the the materials at each point in the life-cycle. And the policies follow the research materials around so the choices of the original researchers are known. Much of this policy description and enforcement is yet to be implemented but DfR has made a start.

The commercial products mentioned above are all good, indeed DfR expects to be able to work with those that permit it. However, the commercial products are unlikely to:

  • Make it easy to integrate with any service you want since they have commercial interests
  • Allow you to see and modify their code, possibly use their protocols, since they are not open source
  • Help projects be more acceptable to funders or institutions requiring a data management plan and compliance by automating data organization and the generation of context information
  • Provide a way other than disk/file copies to fit into your research infrastructure
  • Automate getting extended context information other than just the files themselves

DfR also provides libraries (APIs) to directly integrate your research infrastructure. For example, you can tap into the output of an instrument directly (future feature). In this way you can automatically add context information or trigger value-added services for the researcher (more on that later). If you use a workflow system like Taverna, Kepler, jBPM, Mule, Camel or a BPEL-based product, you can use backup as a step in the process.

DfR uses DuraCloud for storing the backed-up materials. DuraCloud is both free open source software and is a service typically now used by libraries to store archival content for access and preservation, but is increasingly used to back up research data. The DuraCloud service is deployed in the cloud over multiple providers, both public and private. DuraCloud also provides a way for smaller institutions to work together to benefit from the "economy of scale" by sharing an infrastructure. Typically users interact directly with DuraCloud, backing up their content, though DfR adds the UpSync utility. In the background, DuraCloud runs services to ensure that the data retains integrity and remains secure, valid and durable. DfR still needs someone to pay for storage and computer use; that will never go away. But it will make it cheaper and easier plus help make it more open. And, we hope, it will be more attractive for funding long term reuse and preservation.

But it all starts with remote backup.

Access and Discovery

DfR uses the cloud as a means to provide secure access to the research materials both for restoration and as an integration point for services in any networked computing environment.

While DfR starts with backup, we want to support the entire research information life-cycle. This means discovering, curating, and re-using research materials. There are many problems inherent with this goal. Most researchers quickly say "but no-one will understand my data, they should just read the publication." There is a lot of truth to this statement but, by improving our ability to capture the context information, we can incrementally get better. Where researchers are using a well proven methodology on new data DfR can help automate rote data management processes. Simple issues such as calibrations or measurement methods can play havoc with the validity of the research materials in re-use. DfR does not solve all these issues but it does make a start. There is little doubt about need to access backup data. But once it is readily available we can enable new uses too. And our common experience with search engines has shown that discovering information is very important. The researcher will want to access and discover research materials for many purposes including:

  • Restoring files that have been lost or walked off with mobile devices
  • Finding their own files including those that can only be identified by data features within them
  • Finding colleagues' files
  • Finding files containing data of interest to their current research: their own, colleagues and data they had no idea existed
  • Publishing their findings on the Web (at the right time) to enhance understanding of their work and facilitate citation

DfR provides access services at multiple levels and using multiple paths. For example, the Restore tool, DuraCloud access services, the Fedora Repository and SIdora (Drupal) can all be used. New services (applications) can be implemented in terms of the supplied interfaces. The SEE provides many off-the-self connectors that can be used as-is or with simple modifications. Each of these access points are integrated with the DfR security architecture (described below) thus providing the policy enforcement points needed.

Simple discovery support is available using the Solr search engine shared between Fedora and Drupal.  Since DfR is event-driven, a JMS message is issued when new or revised files from UpSync are backed up. JMS messages are also issued whenever an operation is performed on Fedora. When Fedora has receives the associated rich objects, Solr is signaled by Fedora to index both the object and, potentially features extracted from the files. The same methods can be used for custom indexes that lend themselves to one type of search or another. For example, geo-spatial-temporal search of observations could be added since they are of interest to a number of communities.

Like access, DfR recognizes that policies for discovery are based initially on researcher choices and also the point in the research information life-cycle. For example these policies affect what is sent to the search engine (or filtered out of queries) to limit visibility since the researcher may want to limit searches for work-in-progress to him/her self or team. Later, the results are opened up, especially post-publication. The possibilities are endless since DfR provides simple methods using the SEE to embed services that perform feature extraction or filtering functions.

The current demonstration uses SIdora (Drupal plus Islandora) to provide an application view of the data. It can limit what the user sees based on application-level controls. This includes the ability to edit and organize the data. In particular, it adds a second discovery method called navigation (browsing). We are familiar with navigating a directory tree to locate files. In conjunction with Fedora you can use Sidora to navigate your data starting with an analog of your original directory tree. You can also organize your research materials by creating associations between items. The can be done manually in Sidora, or through tooling added to the ROCS.  You can show the associations in a tree browser or use visualization tools that show maps (graphs). Like backup, the preferred method in DfR is to automate are record, as much as possible, association found in the research materials. We use can also use the SEE (like the ROCS does) to do post processing on the backed up files at later dates or triggered when new information is added. Or you can create derived products yourself and upload them (e.g. SIdora does this for its own derivatives like thumbnails generated from a workflow).

Security and Policy Enforcement

DfR addresses security and privacy issues in the cloud, working to find the best practices that support the research information life-cycle.

Security and privacy are important aspects of DfR especially because of the sensitivity of research data. Some data is sensitive because it is pre-publication and some may also be sensitive because it has legal compliance requirements. The cloud presents special considerations because the infrastructure (and data) are outside the researcher's auditable control and, in the case of the public cloud, the institution's. Data security is a complex and evolving issue (especially with the cloud); it is not even well established in law. Security becomes even more complex as considerations will change at different points in the research information life-cycle. A comprehensive discussion of security issues and technology is beyond the scope of this document, especially regarding data sharing. DfR provides support for several common security approaches and is designed to permit future extensions. In this version, DfR concentrates on security during the active (creation) phase of research, focused on the researcher and institution. However, we have also demonstrated some of the features needed for team sharing and Web publication.

UpSync --- Requires that an account be established with the server (DuraCloud) and username/password credentials be entered when setting it up. All transmissions are encrypted via industry standard SSL.   UpSync will support Shibboleth in a future version.

DuraCloud --- Provides simple access control lists to be used by services that directly operate on the backed up files. It manages a copy of all the cloud provider credentials so you don't have to make them available to use them. Your access is via username/password combination or by Shibboleth for integration with your institution's identity system. Each researcher or research group is given one or more "spaces" that are private to their account. Normally you won't access your files outside the DfR (DuraCloud) service to ensure consistent security enforcement, except in your local research systems. It is feasible for privileged services to access DuraCloud to execute compute services, an area that we anticipate could be used to support on-demand services for researchers (e.g. using Map-Reduce). A key point to understand is that DuraCloud security is "course grained" meaning that the same policies extend to all files in space — and to the entire file and any services that run using it.

Fedora --- Provides flexible, fine-grained control over files and the processes that run using them. Fedora is used as a mediator permitting the addition of fine-grained controls to files backed up in DuraCloud. Downloads from Fedora may be encrypted via SSL and may be made accessible to specific users, groups or the public. A good example is permitting access via a publisher for data associated with an article but keeping unpublished data hidden. Fedora also permits adding services that can filter data, limiting access to just parts of a file. Security policies are expressed in XACML, a standard designed just for this purpose. Since XACML is stored in an XML file it can be copied to collaborating services to support distributed, policy-based security. Please note that distributed, policy-based security is an infant technology and once a file is copied outside the managed infrastructure no method has been found to entirely control use, even encryption. While XACML is very expressive, its use remains a work-in-progress especially with regards to researchers. It is hard to design good policy languages that are both sufficiently general and easy to use. The course is to create "domain specific, fluent languages" and tools to build them. In the next section, we show that we have a simple one working already in the demonstration. Note that Fedora also supports Shibboleth.

Drupal (SIdora solution pack in Islandora) --- Drupal provides its own security system (and is used by SIdora/Islandora). This is needed in order to make the user interface responsive visually to the user. SIdora (used for our demonstration) leverages Fedora to store persistent files and context data (metadata). To tie the two systems together, SIdora is able to read and write Fedora XACML. The user is unaware of the details and uses the SIdora security forms (based on Drupal). The data from those forms are kept in the Drupal security database and also in Fedora. SIdora enforces the security policies and also shows "decorations" like greyed-out functions to the user. Security data in SIdora is converted to XACML and stored back into Fedora. Privileged services (actually SIdora is a privileged service) can access Fedora directly. However, Fedora also enforces the XACML policies that are appropriate for its services. And anything stored in Fedora is stored in DuraCloud. Just to complete the list Drupal supports Shibboleth.

SEE — The service execution environment uses the Java security (JAAS) and additional security frameworks provided by Spring and Camel. These integrate with DuraCloud's ACL security, Fedora and any integration services. Drupal has an independent security system but there are proven integration methods. Each can leverage one or more identity control systems include single sign-on systems or enterprise identity management. Shibboleth is one such service used by many institutions and Internet2. Shibboleth implements a version of SAML (Secure Access markup Language) that can pass information between services regarding what the user is permitted to do making it more than just an identity management tool. DfR can also use Web security methods and Web-* (SOAP) security technology. This extends to services that choose to work with standard security frameworks or are willing to be deployed in a managed infrastructure.

Encryption — Outside of SSL and some security framework discussed above, there has been little talk about encryption so far. For now, the only encryption supported by DfR is user-based (the researcher). Encryption requires a careful implementation in an infrastructure designed for long term access. If the researcher encrypts data before backing it up, then loses the key, the data is lost forever. User-based encryption would make some services supplied by DfR, such as checking the file format, impossible. DfR is not prepared at this time to incorporate server-based encryption nor undergo the certifications that would be needed. However, the DfR architecture is designed to accommodate server-based encryption in a future release, though a service provider or a third party will need to keep a copy of the key as long as is needed for the research information life-cycle.

Finally, DfR keeps fixity, provenance and authenticity information about the data it stores. Any changes are logged and kept in audit trails. While this may not prevent an exploit that gains access to unencrypted data, it will ensure that you will know if any data was changed. This will also enable detection if an entire encrypted object was replaced or deleted.

Scalable and Durable

DfR uses the cloud as a means to make providing services scalable, at minimal cost, while automatically providing services for durability ensuring fixity, integrity and authenticity.

Cloud-based storage and compute services have gathered considerable attention with the promise of reducing cost, increasing availability and removing impediments in provisioning resources just when needed. The cloud's purpose is to dramatically reduce the pain of obtaining compute resources and storage, plus the ongoing administration. It does not matter whether this is a private cloud provided by an institution or a for-profit public offering. A cloud may also provide a number of pre-configured services ready for use with minimal startup and operational administration. With the right execution environment, this enables the researcher or institution to pay only for the resources that are used rather than supporting an oversized infrastructure. These characteristics complement specialized resources such as supercomputer (high performance computing) centers, since, while a cloud can provide enormous parallel compute resources, it is optimized for commodity computing and storage. Similarly, a cloud does not substitute for laboratory instrumentation, personal computing resources or non-networked (field) computing. However, a cloud can provide easy provisioning and can provide a safe home for copies of research materials in digital form, augmenting the resources of the researcher.

A cloud is an approach for managing a group of commodity computing resources using a common administrative infrastructure. Many large higher-education institutions are in the process of converting to private clouds to make the best available use of their limited resources. Groups of smaller institutions are also collaborating to build shared non-profit clouds. There are also public for-profit clouds, notably Amazon's. Since these clouds are networked, services and storage can be used between any of them noting that they are administered and billed separately. Generally, a cloud is identified by its provider, the organization that owns and manages it. The DuraCloud (SAAS) acts as cloud aggregator to provide a single technological interface and administrative point of contact (e.g. billing) to multiple clouds.

Remote backup to the cloud offers both advantages and disadvantages.

Advantages:

  • Research materials are stored in a separate, widely-distributed location separate from the source materials without effort by the researcher or institution
  • Unlimited space is available
  • Backups are automatic and do not require researcher intervention once set up
  • Access is available from everywhere
  • You can run compute services over the backed-up data at on-demand costs
  • We think (but this is yet to be determined) that funders may be willing to support long term storage if curation is being done by a "recognized" institution permitting the researcher to offload this burden

Disadvantages

  • Restoration of the data can be slow though there are ways to mitigate this
  • It can be hard to get help in doing restoration if complications arise
  • Security (privacy) can be a concern because the servers are not under your control (encryption is both a solution and a problem but data is always encrypted in transit)
  • Backup is limited by your local available user-to-cloud bandwidth
  • Running services locally on data kept in the cloud may be slower
  • The cost of storage is usually higher versus raw disk costs (but you need to assess total cost of ownership - TCO - to get the real picture)

DfR is built using DuraCloud, both as a "free" open source software product and a non-profit, "software as a service" (SAAS) that is independent of any single cloud provider. In essence, it is an independent organization working with any interested and compatible cloud provider both private and public. It is a non-profit organization with participants from many educational, research and scientific institutions. DuraCloud provides the services needed to ensure durability.

Clouds are not the only way to provide compute services. DfR, is designed to have many on-ramps and off-ramps. Indeed DfR is designed to facilitate integration with the researcher's local compute environment. However, our analysis is that clouds are a long term trend that have many of the characteristics needed for DfR. In particular, clouds offer the middle space between a researcher's local compute resources, collaborators, publishers/publishing and preservation/reuse archives.

Transparent but Managed

DfR provides many ways to support research, reduce the burden of data management, and provide ways to enrich and organize the data first for the researcher and later for the archivist.

Our greatest concern is that researchers will view any added complexity to their data management environment as an intrusion. DfR can reduce the overall data management burden and provide new ways to provision services to researchers but it will also introduce constraints. Introduction of minimal necessary constraints must be balanced by features that make them worthwhile. The more that common methods are used by a community, the easier it is to automate them — but the community of users have to adapt their approaches to fit common methods. The inability for a community to adapt has caused the failure of more software than technological problems. DfR avoids a "one size fits all" mentality and is designed to support heterogeneity. But these goals must be balanced by understanding how much customization is feasible, and that every kind of data and process cannot be given equal attention. Perhaps the most important quality of research is innovation. Despite researchers protests, it is very likely that many of them use similar kinds of data and processes. Getting researchers to use DfR, is both a social and technological problem. Often institutions establish offices for "cultural change" when introducing the organizational changes needed to implement new automation. Some help may be derived from domain organizations and or by institutions, but largely DfR must find natural incentives and a soft, gradual introduction of features (based on structured learning) into the research communities. DfR must be both transparent and managed to work. It must help the researcher easily provision tried and true methods, where acceptable, but also permit methods to be customized as desired.

We don't know and can find no source that already knows what will work best for researchers. Moreover, it varies significantly from domain to domain. Astronomy and genome data has very well structured formats, biology is very heterogeneous, while earth sciences are somewhere in the middle but rapidly creating standards. Data formats and processes also move forward in time; in other words everything will change and no standard is permanent. Therefore, DfR must be designed for change. Key to supporting change is to support heterogeneous digital materials (data et. al.). DfR uses the proven SOA approach using services, and integrating them by transforming message formats between services and back as a core approach. Combined with effective use of registries and a lightweight governance model, DfR is balancing the needs of automation and the nature of its users. The more standards are used, the more economies are available and interoperability is possible. Researchers will have to do less work to get their data management provisioned but they are free to innovate; as long as we eventually know the formats and processes used in their work can be supported.

DfR is predicated on the idea that, if services are transparently incorporated as part of the tools a researcher uses on a daily basis, the more they will be accepted and used. This has been the approach successfully used by HPCs and other big science projects such as the CERN Supercollider. Other notable successes include the tooling incorporated into the GRID, plus Taverna and Kepler. DfR can bring a more lightweight approach to small to medium size projects. Finding the minimal degree of management is the approach to all successful infrastructures such as the Internet, the World Wide Web and the cloud. DfR's goals are much more modest, getting a copy of research data with the critical context data needed for the research information life-cycle.

Aspects that need to be managed:

  • Remote Backup --- it requires the installation of a backup service (and maybe integration with other backup and collaboration tools, requires accounts and must be well monitored automatically
  • Data Formats --- they must be provided unique identifiers even if the details are not known, cross referenced with format registries if possible and, if commonly used, standardized by the cognizant communities
  • Service Interfaces (only those in widespread use and accepted into the core) — Pretty much the same as data formats but including the APIs, protocols and business semantics (open source software communities do this every day)
  • Providers — Providers can stand alone or can be loosely federated; but if they agree to interoperate there is always a degree of governance involved, and they must provide the service level that they promise
  • Security — DfR needs to provide off-site security, and to transparently interface with user and institutional identity management, authentication and authorization infrastructures, and it must be policy-driven and distributable without requiring intrusive centralization

Most of the items in list above are accomplished by a combination of communities and registries, as well as the DfR architecture. Much of this list has been already been done to some degree using enterprise techniques, that may not suitable for most researchers and their institutions. Simpler and more distributable methods are being developed, often driven by business integration through the Internet. DfR is on the cutting edge of understanding these issue and is pre-staging elements needed to introduce solutions. Like all successful infrastructures, DfR is not about introducing a breakthrough disruptive technology (though we are glad to use them) but a balancing act between competing design characteristics.

A key example was discussed above demonstrating how Fedora and SIdora are able to share Content Models, XACML policies and objects together. That, DfR and DuraCloud use discrete, mostly REST-based services. Where, everything is eventually serialized and stored in multiple, remote locations. Through SIdora we have shown model (data-driven) forms for creating and handling metadata, plus visualization tools. Every major component can be replaced without rewriting the whole system. And a hybrid approach was demonstrated through UpSync, DuraCloud (and distributed cloud partners), Fedora and SIdora (Drupal) leveraging the best of each other.

Conclusion

First and formost, DfR must provide services to help the researcher. By succeeding in this, the same services (and information derived from them) are critical for archival activities and the research information life-cycle

DfR in an integration framework and free open source implementation that enables services for research data management. It starts with backup but is able to be evolved into a far reaching support infrastructure to facilitate the research information life-cycle. The current release 0.3 is able to demonstrate end-to-end support for a limited set of services. DfR leverages cloud resources to make these services ubiquitous and ensure the data is durable. DfR 0.3, while incomplete, is sufficient for public release and continued development. This permits beginning a series of small managed service (SAAS) offerings both to help the researcher but also to test what is helpful and acceptable to them. If the researchers support DfR then we are able to integrate services that satisfy the needs of other actors in the research information life-cycle.

Key aspects of this project include:

  • Design for change using SOA principles largely through decoupling, a focus on data transformations and promoting the re-use of services
  • Use of messaging (event-driven), integration frameworks and orchestration where it helps decoupling and re-use while not introducing unneeded complexity
  • Preferring asynchronous operations and eventual consistency where feasible 
  • Supporting  heterogeneity  to the greatest possible degree in both data and processing
  • Being as data driven as possible, using models to help automate this
  • Using registries to help create lightweight, distributed governance processes and supporting business logic
  • Promoting the use of layered identifiers, formats and protocols to manage complexity
  • Preferring well-supported off-the-shelf technology and standards to the greatest degree possible, concentrating on integration, concentrating on convention over configuration
  • Using a hybrid approach, leveraging researcher, institution and the cloud together
  • Using structured learning to manage incremental implementation
  • Above all, capturing both research data and context materials early and everywhere that they are available — and copying them into the infrastructure

We have left an number of aspects needing further work. In particular:

  • It is very hard to find researchers willing to be early adopters so we are just to the point having enough to draw them out to learn from them
  • Cloud-based security including encryption is still a work-in-progress and needs further development
  • Distributed policy enforcement is an infant technology and is important to implement loosely coupled governance and to federate infrastructure resources
  • Communities must link up to make a comprehensive infrastructure rather than go it alone

DfR successfully demonstrated the basics of an integrated infrastructure for the research information life-cycle. Assembling an infrastructure takes a long time and a portfolio of projects supported by many communities. The technologies needed to start are sufficiently mature and many aspects of the solutions are well understood. Parts of DfR are moving into production with DuraCloud notably UpSync. Others, particularly ROCS with the SEE will be extended in collaboration with the Smithsonian and APTrust. However, much work remains in forming sustainable coalitions that will collaborate in service and data integration.

Additional Resources

Project data model used University of Illinois publication - "Identifying Content and Levels of Representation in Scientific Data"

  • No labels