Introduction

Exchanging repository contents

The most sites on the Internet are oriented towards human consumption. While HTML may be a good format to create websites it is not a good format to export data in a way a computer can work with. Like the most software for repositories DSpace supports OAI-PMH as an interface to export the stored data. While OAI-PMH is well known in the field of repositories it is rarely known elsewhere (e.g. Google retired its support for OAI-PMH in 2008). The Semantic Web is an generic approach to publish data on the Internet together with information about its semantics. The W3C released standards like RDF or SPARQL for publishing structured data on the Web in a way computers can easily work with. The data stored in repositories is particularly suited to be used in the Semantic Web, as metadata is already available. It doesn’t have to be generated or entered manually for publication as Linked Data. For most repositories, at least for Open Access repositories, it is quite important to share their stored content. Linked Data is a rather big chance for repositories to present their content in a way it can easily be accessed, interlinked and (re)used.

Terminology

We don't want to give a full introduction into the Semantic Web and its technologies here as there can by found many on the web. Nevertheless we want to give a short glossar about the terms used most often in this content to make the following documentation more readable.

Semantic Web	The term "Semantic Web" refers to the part of the Internet containing Linked Data. As in the World Wide Web the Semantic Web is created by links between the data.
Linked Data Linked Open Data	Linked Data is used for data in RDF, following the Linked Data Principles. The Linked Data Principles describes expected behavior by data publishers that shall ensure that the data published is easy to find, easy to retrieve, can be linked easily and links to other data as well. Linked Open Data is Linked Data published using an open license. Technically there is no difference between Linked Data and Linked Open Data (often abbreviated as LOD), it is only a question of the license used to publish.
RDF RDF/XML Turtle N-Triples N3-Notation	RDF is an acronym for Resource Description Framework, a meta data model. Don't think of RDF as a format, as it is a model. Nevertheless there are different formats to serialize data following RDF. RDF/XML, Turtle, N-Triples and N3-Notation are probably the most known formats to serialize data in RDF.
Triple Store	A triple store is a database to natively store data following the RDF approach.
SPARQL	The SPARQL Protocol and RDF Query Language is a protocol to query triple stores. Since SPARQL version 1.1 it can be used to manipulate triple stores as well, to store, delete or updata data in triple stores.
SPARQL endpoint	A SPARQL endpoint is an SPARQL interface of a triple store. Since SPARQL 1.1 a SPARQL endpoint can be read-only, allowing to query the stored data only or it can be read-writable allowing to modified stored data as well.

Linked (Open) Data Support within DSpace

Starting with DSpace 5.0 DSpace supports to provide stored contents as Linked (Open) Data.

Architecture / Concept

To publish content stored in DSpace as Linked (Open) Data the data has to be converted into RDF. The conversion into RDF has to be configurable as different DSpace instances may uses different meta data schemata, different persistent identifiers (DOI, Handle, ...) and so on. Depending on the content to convert, the configuration and other parameters the conversion may be time and performance intensive. Contents of repositories is much more often read then created, deleted or changed as the main target of repositories is to safely store their contents. For this reasons content stored within DSpace is stored in a triple store after conversion. The triple store serves as a cache and provides a SPARQL endpoint to make the converted data accessible using SPARQL. The conversion is triggered by a consumer of the DSpace event system and can be started manually using a command line interface (both are documented below). Beside the SPARQL endpoint the data should be published as RDF serialization as well. With dspace-rdf DSpace offers a module that loads converted data from the triple store and provides it as RDF serialization (it currently supports RDF/XML, Turtle and N-Triples). Repositories use Persistent Identifiers to make content citable and to address contents. Following the Linked Data Principles DSpace uses Persistent Identifier in the form of HTTP(S)-URIs, converting a handle to http://hd.handle.net/<handle> and a DOI to http://dx.doi.org/<doi>. Bringing it all together the Linked Data support of DSpace extends all three Layers: the storage layer with a triple store, the business logic with classes to convert stored contents into RDF and the application layer with a module to publish RDF serializations. As you can use DSpace with Oracle or Postgresql you may choose between different triple stores. The only requirement is that the triple store must support SPARQL 1.1 as DSpace uses SPARQL 1.1 to store, update and delete converted data in the triple store and the triple store shall provide a read-only SPARQL endpoint publicly.

All Versions

DSpace Documentation

Page tree

Introduction

Exchanging repository contents

Terminology

Linked (Open) Data Support within DSpace

Architecture / Concept

Installation

Configuration

Maintenance