Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

<?xml version="1.0" encoding="utf-8"?>
<html>
This page proposes a set of changes
to improve the representation of file formats in the content model, in
order to better support preservation activities.
The design applies to DSpace 1.5, or later.
This work is to be done
as part of the
FACADE project but it is designed and intended as
a generally useful extension to the platform.

...

Note that this implies more than just knowing the common name of a Bitstream's
format, e.g. "Adobe PDF". That name actually describes a family of formats.
In order to know exactly how to recover the
intelligence in a particular Bitstream, you'd want to know which specific version
of PDF it is: later versions have features not found in earlier ones.
The "internal structure and encoding" imposed by a data format is
usually defined in exacting detail by a format specification document,
and/or by the software applications that produce and consume that format.

...

The original DSpace design intentionally avoided the issue of describing
data formats in such detail because there were already other efforts underway
to thoroughly catalog data formats – and DSpace would eventually leverage
their work. As of June, 2007, the most sophisticated
data format registries are still
in development, but some usable systems are operating in production.
We propose
to integrate external data format intelligence through
a flexible plugin-based architecture to take advantage of what is
currently available but leave a clear path for future upgrades and changes.
It also lets each DSpace installation choose an appropriate level of
complexity and detail in their format support.

...

  • Each BSF represents a description of a single, unique data format; there is exactly one BSF for each distinct data format referenced by Bitstreams in the DSpace archive.
  • A BSF is bound to one or more entries in external data format registries.** The identifiers are logically all peers, although the metadata cached in the BSF is only imported (or updated) from one of them.
    • All external format identifiers which describe the equivalent format must be bound to the same DSpace BSF – in other words, there should never be two BSFs describing the same conceptual format, such as "PDF Version 1.2"; one BSF encompasses all synonym external identifiers.
  • The BSF's function is to describe the data format of the contents of a Bitstream, and nothing more.
    • Application code must not "overload" a BSF with additional implicit meanings, such as marking Bitstreams invisible in a UI or indicating a function such as the deposit license.
  • One special BSF, the unknown format, represents the unknown or unidentified data format.
  • Every Bitstream refers to exactly one BSF:
    • If its format has not been assigned or identified, it is the unknown format.
    • This allows an application to assume every Bitstream has a valid BSF with all of its attendant properties, so e.g. it can get a valid MIME type.

...

Namespace

Description

DSpace-Internal

Contains only the Unknown identifier, a placeholder for the
unknown format which represents an unidentified Bitstream format.
This is the only mandatory namespace which is automatically configured.

DSpace

Contains most of the original generic formats defined by DSpace, for
backward-compatibility and for archives which do not care about precise
data format descriptions

Provisional

For custom data formats local to the archive. Provisional extensions to the
"DSpace" format registry are put in their own namespace so there is no
chance of a conflict with formats added later to "DSpace", and also to
make their status as local extensions obvious.

PRONOM

PUIDs from PRONOM's format registry

GDFR

Persistent identifiers from the Global Distributed Format Registry.

LOC

Library of Congress Sustainability of Digital Formats project
format descriptions.

The standard namespace values are available as public static fields
on the <tt>FormatRegistryManager</tt>
class. The LOC namespace
is not really a registry yet but it makes sense to reserve
the namespace since it is a significant source of format technical metadata.

...

Panel
borderColor#ccc
bgColor#fff
titleBitstreamFormat Properties
borderStyledashed

Property

Source

Mod

Description

Name

Registry, Can override

Yes

Brief, human-readable description of this format, for listing it in menus.
Used to be "short description".

Description

Registry, Can override

Yes

Detailed human-readable explanation of the format including its unique aspects.

Identifier

Registry

No

List of all namespaced identifiers linking this BSF to an entry in an
established data format registry. A BSF must have at least one identifier.
This list is ordered; the first member names the external registry entry
that was originally imported to create this BSF.

Support-level

User-entered

Yes

Encoding of the local DSpace archive's policy regarding preservation of Bitstreams encoded in this format. Value must be one of:

  1. Unset - Policy not yet initialized, flags format entries that need attention from the DSpace administrator.
  2. Unrecognized - Format cannot be identified.
  3. Known - Format was identified but preservation services are not promised.
  4. Supported - Bitstream will be preserved.

MIME-type

Registry, Can override

Yes

Canonical MIME type (Internet data type) that describes this format. This is where the <tt>Content-Type</tt> header's value comes from, when delivering a Bitstream by HTTP.

Extension

Registry, Can override

Yes

The canonical filename extension to apply to unnamed Bitstreams when
delivering content over HTTP and in DIPs. (NOTE: Some format
registries have a list of filename extensions is, used to help
identify formats, but we only need the canonical extension in the BSF model.

LastUpdated

System

No

Timestamp when this BSF was imported or last updated from its home registry.

...

Add the following methods:

Panel

Wiki Markup
// Returns all external identifiers bound to this BSF


public String\[\] getIdentifiers()


throws SQLException, AuthorizeException;


//


// Add a binding to an external identifier


// Versions to accept separate namespace and identifier, or namespaced identifier.


public void addIdentifier(String nsIdentifier)


throws SQLException, AuthorizeException;


public void addIdentifier(String namespace, String identifier)


throws SQLException, AuthorizeException;


//


// remove a binding to an external identifier.


public void deleteIdentifier(String nsIdentifier)


public void deleteIdentifier(String namespace, String identifier)


throws SQLException, AuthorizeException


//


// Find BSF bound to an external identifier, returns null if none found.


// Versions to accept separate namespace and identifier, or namespaced identifier.


public BitstreamFormat findByIdentifier(Context context, String nsIdentifier)


throws SQLException, AuthorizeException, FormatRegistryException


public BitstreamFormat findByIdentifier(Context context, String namespace, String identifier)


throws SQLException, AuthorizeException, FormatRegistryException


//


// Advanced version with extra "import" param, says to NOT look


// for format in registries, but to return 'null' if there is no


// existing BSF matching the indicated format. (Mainly for internal use.)


public BitstreamFormat findByIdentifier(Context context, String namespace, String identifier, boolean import)


throws SQLException, AuthorizeException, FormatRegistryException


//


// Return true if this BSF conforms to the target identifier, i.e. if


// would be acceptable to a service that accepts the given format.


public boolean conformsTo(String nsIdentifier);


throws SQLException, AuthorizeException, FormatRegistryException


//


// Return/set the canonical filename extension (without ".").


public String getCanonicalExtension();


public setCanonicalExtension(String extension);


//


// Get and set the Name (takes place of ShortDescription)


public String  getName();


public void    setName(String s);


//


// Flags that show whether to override the values picked up from external registry


// Set to false to remove override.


public boolean isOverrideName();


public void setOverrideName(boolean val);


public boolean isOverrideDescription();


public void setOverrideDescription(boolean val);


public boolean isOverrideMIMEType();


public void setOverrideMIMEType(boolean val);


public boolean isOverrideCanonicalExtension();


public void setOverrideCanonicalExtension(boolean val);


//


// return true if this BSF is the unknown format.


public boolean isUnknown()


//


// return date this external-id was last imported to the BSF.


public Date getLastImported(String namespace, String id)


public Date getLastImported(String nsidentifier)


throws SQLException, AuthorizeException


//


// set the date this external-id was last imported to the BSF.


public void setLastImported(String namespace, String id, Date newdate)


public void setLastImported(String nsidentifier, Date newdate)


throws SQLException, AuthorizeException

Remove these methods:

Panel

findByShortDescription()
findByMIMEType()
findNonInternal()
//
isInternal()
setInternal()
//
getShortDescription() // renamed to getName()
setShortDescription() // renamed to setName()
//
getExtensions()
setExtensions()

...

These existing methods are retained, just mentioned here for completeness.

Panel

Wiki Markup
static BitstreamFormat  create(Context context);


void   delete();


static BitstreamFormat   find(Context context, int id);


static BitstreamFormat   findUnknown(Context context);


static BitstreamFormat\[\] findAll(Context context);


String  getDescription();


int     getID()


String  getMIMEType();

int getSupportLevel

int     getSupportLevel();


static int  getSupportLevelID(String slevel);


void    setDescription(String s);


void    setMIMEType(String s);


void    setSupportLevel(int sl);


void    update();

<tt>FormatRegistryManager</tt>

...

Here is a sketch of the API:

Panel

Wiki Markup
public class FormatRegistryManager


\{


// Namespaces for internal format registry - contains only "Unknown"


public static final String INTERNAL_NAMESPACE = "Internal";


//


// Name of the unknown format:


public static final String UNKNOWN_FORMAT_IDENTIFIER = "Unknown";


//


// Applications should use this as default mime-type.


public static final String DEFAULT_MIME_TYPE = "application/octet-stream";


//


// returns possibly-localized human-readable name of Unknown format.


public static String getUnknownFormatName(Context context);


//


// Returns registry plugin for external format identifier namespace


public static FormatRegistry find(String namespace);


//


// Returns array of *all* Namespace strings, even "artifacts" no longer configured.


public static String\[\] getAllNamespaces(Context context)


throws SQLException, AuthorizeException


//


// Returns array of all currently Namespaces of external registries.


public static String\[\] getRegistryNamespaces();


//


// Calls apropriate registry plugin to import format bound to a namespaced identifier.


// Returns null on error.


public static BitstreamFormat importExternalFormat(Context context, String namespace, String identifier)


throws FormatRegistryException, AuthorizeException


//


// Calls apropriate registry plugin to update format bound to a namespaced identifier.


// When force is true, update even when external format has not been modified.


public static void updateBitstreamFormat(Context context, BitstreamFormat existing, String namespace, String identifier, boolean force)


throws FormatRegistryException, AuthorizeException


//


// Calls apropriate registry plugin to compare two namespaced


// identifies (which must be in the same namespace).


public static boolean conformsTo(String nsIdent1, String nsIdent2)


throws FormatRegistryException


//


// Creates a namespaced identifier out of separate namespace and registry-specific identifier.


public static String makeIdentifier(String namespace, String identifier);


//


// Returns the namespace or identifier portion of a namespaced identifier.


public static String namespaceOf(String nsIdentifier)


public static String identifierOf(String nsIdentifier)


\}

<tt>FormatRegistry</tt>

The <tt>FormatRegistry</tt> interface models an external data format
registry.
We define data format registry as any formally organized and administered
collection of technical metadata about data formats.
This may include a collection published mainly for human consumption
such as the
Library of Congress Sustainability of Digital Formats
format catalog, as well as those accessible through public APIs such
as the
GDFR
and
DROID.
The only requirement is that the data formats are named by unchanging,
unique identifiers.

...

Here is the API of the <tt>FormatRegistry</tt>.
The plugin's name is also the DSpace string value representing its namespace.
It is implemented as a self-named plugin, so that the instance itself
knows its namespace without depending on each DSpace administrator to get
it right. The namespaces must be consistent between DSpace installations
so that format technical metadata (i.e. PREMIS elements in AIPs) can
be meaningfully exchanged.

Panel

Wiki Markup
// implementing classes should extend SelfNamedPlugin


package org.dspace.content.format;


public interface FormatRegistry


\{


// Typically returns 1 element, the Namespace name of the implementation's registry


String \[\] getPluginNames();


//


// Returns the DSpace namespace of this registry.


public String getNamespace();


//


// Return an URL needed to configure the underlying registry service;


// this allows the registry to configure itself from the DSpace


// configuration.


public URL getContactURL();


//


// Returns all external identifiers known to be synoyms of the


// given one, in namespaced-identifier format. (Because one registry


// may know about synonyms in other registries.)


public String\[\] getSynonymIdentifiers(Context context, String identifier)


throws FormatRegistryException, AuthorizeException


//


// Import a new data format - returns a BitstreamFormat.  There


// not be any existing BSF with the same namespace and identifier.


public BitstreamFormat importExternalFormat(Context context, String identifier)


throws FormatRegistryException, AuthorizeException


//


// Compare existing DSpace format against registry, updating anything that's changed.


// NOTE: it does not need to check last-modified date, framework does that.


public BitstreamFormat updateBitstreamFormat(Context context, BitstreamFormat existing, String identifier)


throws FormatRegistryException, AuthorizeException


//


// Return date when this entry was last changed, or null if unknown.


public Date getLastModified(String identifier)


throws FormatRegistryException


//


// Predicate, true if format named by sub is a subtype or


// otherwise "conforms" to the format defined by fmt.


public boolean conformsTo(String sub, String fmt)


throws FormatRegistryException


//


// Free any resources associated with this registry connection,


// since it will not be used any more.


public void shutdown()


\}

Registry Name

Typically the name of the registry is bound to some well-known public
constant so it can be referred to in a program without a "magic string"
that is easily misspelled to disasterous effect. E.g.:

...

Supplied with the identifiers of two entries in the registry, this
predicate function returns true if the the first format conforms to
the second. That means, any Bitstream identified as the
first format would pass the tests to be identified as the second as well.
For example, if the first format is a specific version of a format while
the second identifier names a format family which includes it,
conformsTo would be true.

...

The initial implementation also includes built-in format registries for the
DSpace and Provisional registry namespaces.
Unlike the DSpace-Internal registry, they are optional.
By itself, the
DSpace registry reproduces the release 1.4.x behavior to offer
the option of backward-compatibility.
The Provisional registry offers a separate place to put formats local
to the archive, safe from namespace collisions and future updates
to the DSpace registry. (It is not always the recommended
way to handle new formats, more on this later.)

...

One problem that has not yet been completely addressed by this
design is that many format-identification methods require random access
to the contents of a Bitstream, but the Bitstream API only offers
serial access through a Java <tt>InputStream</tt>. Random access
means reading a sequence of bytes from the Bitstream starting at
any point in its extent; this is very helpful when looking for an
internal signature to identify the file, since the signature may
be located relative to the end of the file or at some larged offset into it.

...

Please use the Discussion Page
for your comments on this page.

Other Documentation

...