Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3


See Also these related wiki pages:




Table of Contents


In <cite>Automatic Automatic Format Identification using PRONOM and DROID</cite>DROID, Adrian Brown defines a "data format" as:<blockquote>The

The internal structure and encoding of a digital object,


which allows it to be processed, or to be rendered in human-accessible




Note that this implies more than just knowing the common name of a Bitstream's format, e.g. "Adobe PDF". That name actually describes a family of formats. In order to know exactly how to recover the intelligence in a particular Bitstream, you'd want to know which specific version of PDF it is: later versions have features not found in earlier ones. The "internal structure and encoding" imposed by a data format is usually defined in exacting detail by a format specification document, and/or by the software applications that produce and consume that format.

For additional, extensive, background, see About Data Formats , which serves as a manifesto of sorts for this project.


  • Enable accurate, meaningful, fine-grained, and globally-understood identification of a Bitstream's data format.
  • Maintain backward compatibility with most existing code, and existing archives.
  • Introduce the binding of persistent, externally-assigned data format identifiers to BitstreamFormats.
  • Integrate tightly with "standard" data format registries, using a plugin framework for flexible configuration:
    • Anticipate that the Global Digital Format Registry (GDFR) will be the registry of choice, but allow free choice of other metadata sources.
    • Recognize references to entries in "standard" data format registries in ingested content (e.g. technical MD in SIPs) to facilitate exchange of SIPs and DIPs.
    • The DSpace data model directly includes only the subset of format metadata it has an immediate use for, and references entries in an external format registry for the rest.
    • Refer to formats by the external data format registry's identifiers so format technical metadata is recognized outside of DSpace.
  • Improve the automatic identification of data formats in batch and non-interactive content ingestion.
  • Help interactive users identify formats easily and with accuracy during interactive submission.
  • Rationalize use of BitstreamFormat object:
    • Eliminate the overloaded use of the "License" format and "Internal" flag in BitstreamFormats to mark and hide deposit license bitstreams.
    • Attempt to accurately describe the data format of every Bitstream, even the ones created for internal use.
  • Create pluggable interface to external data format registries, to encourage experimentation and track developments in this highly active field.
  • Add a separate pluggable format-identification interface to allow a "stack" of methods to identify the format of a Bitstream by various techniques.

Use Cases

See BitstreamFormat Renovation for the sketches of the anticipated use cases that drove this design. The text grew too large for one page.


titleBitstreamFormat PropertiesborderStyledashed






Registry, Can override


Brief, human-readable description of this format, for listing it in menus.
Used to be "short description".


Registry, Can override


Detailed human-readable explanation of the format including its unique aspects.




List of all namespaced identifiers linking this BSF to an entry in an
established data format registry. A BSF must have at least one identifier.
This list is ordered; the first member names the external registry entry
that was originally imported to create this BSF.




Encoding of the local DSpace archive's policy regarding preservation of Bitstreams encoded in this format. Value must be one of:

  1. Unset - Policy not yet initialized, flags format entries that need attention from the DSpace administrator.
  2. Unrecognized - Format cannot be identified.
  3. Known - Format was identified but preservation services are not promised.
  4. Supported - Bitstream will be preserved.


Registry, Can override


Canonical MIME type (Internet data type) that describes this format. This is where the Content-Type header's value comes from, when delivering a Bitstream by HTTP.


Registry, Can override


The canonical filename extension to apply to unnamed Bitstreams when
delivering content over HTTP and in DIPs. (NOTE: Some format
registries have a list of filename extensions is, used to help
identify formats, but we only need the canonical extension in the BSF model.




Timestamp when this BSF was imported or last updated from its home registry.


Add the following methods:


// Returns all external identifiers bound to this BSF
public String\[\] getIdentifiers()
throws SQLException, AuthorizeException;
// Add a binding to an external identifier
// Versions to accept separate namespace and identifier, or namespaced identifier.
public void addIdentifier(String nsIdentifier)
throws SQLException, AuthorizeException;
public void addIdentifier(String namespace, String identifier)
throws SQLException, AuthorizeException;
// remove a binding to an external identifier.
public void deleteIdentifier(String nsIdentifier)
public void deleteIdentifier(String namespace, String identifier)
throws SQLException, AuthorizeException
// Find BSF bound to an external identifier, returns null if none found.
// Versions to accept separate namespace and identifier, or namespaced identifier.
public BitstreamFormat findByIdentifier(Context context, String nsIdentifier)
throws SQLException, AuthorizeException, FormatRegistryException
public BitstreamFormat findByIdentifier(Context context, String namespace, String identifier)
throws SQLException, AuthorizeException, FormatRegistryException
// Advanced version with extra "import" param, says to NOT look
// for format in registries, but to return 'null' if there is no
// existing BSF matching the indicated format. (Mainly for internal use.)
public BitstreamFormat findByIdentifier(Context context, String namespace, String identifier, boolean import)
throws SQLException, AuthorizeException, FormatRegistryException
// Return true if this BSF conforms to the target identifier, i.e. if
// would be acceptable to a service that accepts the given format.
public boolean conformsTo(String nsIdentifier);
throws SQLException, AuthorizeException, FormatRegistryException
// Return/set the canonical filename extension (without ".").
public String getCanonicalExtension();
public setCanonicalExtension(String extension);
// Get and set the Name (takes place of ShortDescription)
public String getName();
public void setName(String s);
// Flags that show whether to override the values picked up from external registry
// Set to false to remove override.
public boolean isOverrideName();
public void setOverrideName(boolean val);
public boolean isOverrideDescription();
public void setOverrideDescription(boolean val);
public boolean isOverrideMIMEType();
public void setOverrideMIMEType(boolean val);
public boolean isOverrideCanonicalExtension();
public void setOverrideCanonicalExtension(boolean val);
// return true if this BSF is the unknown format.
public boolean isUnknown()
// return date this external-id was last imported to the BSF.
public Date getLastImported(String namespace, String id)
public Date getLastImported(String nsidentifier)
throws SQLException, AuthorizeException
// set the date this external-id was last imported to the BSF.
public void setLastImported(String namespace, String id, Date newdate)
public void setLastImported(String nsidentifier, Date newdate)
throws SQLException, AuthorizeException

Remove these methods:


getShortDescription() // renamed to getName()
setShortDescription() // renamed to setName()


These existing methods are retained, just mentioned here for completeness.


static BitstreamFormat create(Context context);
void delete();
static BitstreamFormat find(Context context, int id);
static BitstreamFormat findUnknown(Context context);
static BitstreamFormat\[\] findAll(Context context);
String getDescription();
int getID()
String getMIMEType();
int getSupportLevel();
static int getSupportLevelID(String slevel);
void setDescription(String s);
void setMIMEType(String s);
void setSupportLevel(int sl);
void update();


Following DSpace coding conventions, the factory and static class for a service is named with the suffix -Manager. The FormatRegistryManager class gives access to instances of FormatRegistry. Since a format identifier is directed to a FormatRegistry implementation by its namespace, the Manager also takes care of selecting the right instance for a namespaced identifier. This lets applications use namespaced identifiers without worrying about taking them apart to choose a registry instance.


Here is a sketch of the API:


public class FormatRegistryManager \
// Namespaces for internal format registry - contains only "Unknown"
public static final String INTERNAL_NAMESPACE = "Internal";
// Name of the unknown format:
public static final String UNKNOWN_FORMAT_IDENTIFIER = "Unknown";
// Applications should use this as default mime-type.
public static final String DEFAULT_MIME_TYPE = "application/octet-stream";
// returns possibly-localized human-readable name of Unknown format.
public static String getUnknownFormatName(Context context);
// Returns registry plugin for external format identifier namespace
public static FormatRegistry find(String namespace);
// Returns array of *all* Namespace strings, even "artifacts" no longer configured.
public static String\[\] getAllNamespaces(Context context)
throws SQLException, AuthorizeException
// Returns array of all currently Namespaces of external registries.
public static String\[\] getRegistryNamespaces();
// Calls apropriate registry plugin to import format bound to a namespaced identifier.
// Returns null on error.
public static BitstreamFormat importExternalFormat(Context context, String namespace, String identifier)
throws FormatRegistryException, AuthorizeException
// Calls apropriate registry plugin to update format bound to a namespaced identifier.
// When force is true, update even when external format has not been modified.
public static void updateBitstreamFormat(Context context, BitstreamFormat existing, String namespace, String identifier, boolean force)
throws FormatRegistryException, AuthorizeException
// Calls apropriate registry plugin to compare two namespaced
// identifies (which must be in the same namespace).
public static boolean conformsTo(String nsIdent1, String nsIdent2)
throws FormatRegistryException
// Creates a namespaced identifier out of separate namespace and registry-specific identifier.
public static String makeIdentifier(String namespace, String identifier);
// Returns the namespace or identifier portion of a namespaced identifier.
public static String namespaceOf(String nsIdentifier)
public static String identifierOf(String nsIdentifier) \


The FormatRegistry interface models an external data format registry. We define data format registry as any formally organized and administered collection of technical metadata about data formats. This may include a collection published mainly for human consumption such as the Library of Congress Sustainability of Digital Formats format catalog, as well as those accessible through public APIs such as the GDFR and DROID. The only requirement is that the data formats are named by unchanging, unique identifiers.


Here is the API of the FormatRegistry. The plugin's name is also the DSpace string value representing its namespace. It is implemented as a self-named plugin, so that the instance itself knows its namespace without depending on each DSpace administrator to get it right. The namespaces must be consistent between DSpace installations so that format technical metadata (i.e. PREMIS elements in AIPs) can be meaningfully exchanged.


// implementing classes should extend SelfNamedPlugin
package org.dspace.content.format;
public interface FormatRegistry \
// Typically returns 1 element, the Namespace name of the implementation's registry
String \ [\] getPluginNames();
// Returns the DSpace namespace of this registry.
public String getNamespace();
// Return an URL needed to configure the underlying registry service;
// this allows the registry to configure itself from the DSpace
// configuration.
public URL getContactURL();
// Returns all external identifiers known to be synoyms of the
// given one, in namespaced-identifier format. (Because one registry
// may know about synonyms in other registries.)
public String\[\] getSynonymIdentifiers(Context context, String identifier)
throws FormatRegistryException, AuthorizeException
// Import a new data format - returns a BitstreamFormat. There
// not be any existing BSF with the same namespace and identifier.
public BitstreamFormat importExternalFormat(Context context, String identifier)
throws FormatRegistryException, AuthorizeException
// Compare existing DSpace format against registry, updating anything that's changed.
// NOTE: it does not need to check last-modified date, framework does that.
public BitstreamFormat updateBitstreamFormat(Context context, BitstreamFormat existing, String identifier)
throws FormatRegistryException, AuthorizeException
// Return date when this entry was last changed, or null if unknown.
public Date getLastModified(String identifier)
throws FormatRegistryException
// Predicate, true if format named by sub is a subtype or
// otherwise "conforms" to the format defined by fmt.
public boolean conformsTo(String sub, String fmt)
throws FormatRegistryException
// Free any resources associated with this registry connection,
// since it will not be used any more.
public void shutdown() \

Registry Name

Typically the name of the registry is bound to some well-known public constant so it can be referred to in a program without a "magic string" that is easily misspelled to disastrous effect. E.g.:


This is the default algorithm that is implemented by FormatIdentifier.identifyFormat() methods that simply call the FormatHit's addToResults() method on each hit they develop.



NOTE: It is not necessary to use this algorithm.


As described above, the format identification process is completely


under the control of the identifyFormat() method implementations.


  1. Start with an empty results list.
  2. Call the FormatIdentifier.identifyFormat() method of each plugin in the sequence in turn:
    • Passing it the Bitstream and list of accumulated results so it can add new results.
    • If it has a better-confidence match than the current head of the list, that hit becomes the new head of the list.
    • Otherwise the hit gets appended to the end of the list.
  3. When finished, the head of the list is the best format match.


Panel = \
org.dspace.content.format.PRONOMFormatRegistry, \
org.dspace.content.format.DSpaceFormatRegistry, \

  1. initialization files configured as "contact URIs":
    formatRegistry.DSpace.document = /dspace/config/registries/dspace-formats.xml
    formatRegistry.DSpace.validate = true
    formatRegistry.DSpace.schema = /dspace/config/registries/formats.xml
    formatRegistry.Provisional.document = /dspace/config/registries/provisional-formats.xml
    formatRegistry.Provisional.validate = true
    formatRegistry.Provisional.schema = /dspace/config/registries/formats.xml = Removed

Format identifiers are configured in a sequence plugin, as in this example:


Please use the Discussion Page for your comments on this page.

Other Documentation
