Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: I noticed that the acronym SAF is used further down the page without explicitly definining what it stands for - I added (SAF) next to Simple Archive Format.

Table of Contents
minLevel2
outlinetrue
stylenone

Item Importer and Exporter

DSpace has a set of command line tools for importing and exporting items in batches, using the DSpace Simple Archive Format (SAF). Apart from the offered functionality, these tools serve as an example for users who aim to implement their own item importer.

DSpace Simple Archive Format

The basic concept behind the DSpace's Simple Archive Format (SAF) is to create an archive, which is a directory containing one subdirectory per item. Each item directory contains a file for the item's descriptive metadata, and the files that make up the item.

Code Block
archive_directory/
    item_000/
        dublin_core.xml         -- qualified Dublin Core metadata for metadata fields belonging to the 'dc' schema.
        metadata_[prefix].xml   -- metadata in another schema.  The prefix is the name of the schema as registered with the metadata registry.
        contents                -- text file containing one line per filename.
		collections				-- (Optional) text file that contains the handles of the collections the item will belong to. Each handle in a row.
								-- Collection in first line will be the owning collection.
		handle					-- contains the handle assigned/to be assigned to this resource
        relationships           -- (Optional) If importing Entities, you can specify one or more relationships to create on import
        file_1.doc              -- files to be added as bitstreams to the item.
        file_2.pdf
    item_001/
        dublin_core.xml
        contents
        file_1.png
        ...

dublin_core.xml or metadata_[prefix].xml

The dublin_core.xml or metadata_[prefix].xml file has the following format, where each metadata element has its own entry within a <dcvalue> tagset. There are currently three tag attributes available in the <dcvalue> tagset:

...

Note
titleRecommended Metadata

It is recommended to minimally provide "dc.title" and, where applicable, "dc.date.issued".  Obviously you can (and should) provide much more detailed metadata about the Item.  For more information see: Metadata Recommendations.

contents file

The contents file simply enumerates, one file per line, the bitstream file names. See the following example:

...

'IIIFHEIGHT' is the image height that will be used for the IIIF canvas.


relationships file

Note
titleSupported in 7.1 or above for 'import' only.
This feature was added in 7.1. Currently the 'relationships' file is only supported on import ('add' mode) of an SAF package.  See note at bottom of this section about using the "metadata_relation.xml" if you wish to export & update relationships.

...

Note
titleRelationships to existing Entities can also be created via metadata_relation.xml

If you already know the UUID of an existing Entity that you want to relate to, you can also create/update the "metadata_relation.xml" file to add/update the relationship, similar to:

Code Block
titlemetadata_relation.xml
<dublin_core schema="relation">
  <dcvalue element="isAuthorOfPublication">5dace143-1238-4b4f-affb-ed559f9254bb</dcvalue>
</dublin_core>

The "relationships" file is primarily for creating relationships between Entities in the same import batch. Of course, you can also choose to use the "relationships" file to create new relationships to existing Entities instead of creating/updating the "metadata_relation.xml" file.  The main advantage of the "metadata_relation.xml" file is that it is used both on export and import, while the "relationships" file is only used on import at this time.


Configuring metadata_[prefix].xml for a Different Schema

It is possible to use other Schema such as EAD, VRA Core, etc. Make sure you have defined the new schema in the DSpace Metadata Schema Registry.

  1. Create a separate file for the other schema named metadata_[prefix].xml, where the [prefix] is replaced with the schema's prefix.
  2. Inside the xml file use the same Dublin Core syntax, but on the <dublin_core> element include the attribute schema=[prefix].
  3. Here is an example for ETD metadata, which would be in the file metadata_etd.xml:

    Code Block
    <dublin_core schema="etd">
         <dcvalue element="degree" qualifier="department">Computer Science</dcvalue>
         <dcvalue element="degree" qualifier="level">Masters</dcvalue>
         <dcvalue element="degree" qualifier="grantor">Michigan Institute of Technology</dcvalue>
    </dublin_core>


Importing Items

Before running the item importer over items previously exported from a DSpace instance, please first refer to Transferring Items Between DSpace Instances.

...

The item importer is able to batch import unlimited numbers of items for a particular collection using a very simple CLI command and 'arguments'.

Adding Items to a Collection from a directory

To add items to a collection, you gather the following information:

...

Testing. You can add --validate (or -v) to the command to simulate the entire import process without actually doing the import. This is extremely useful for verifying your import files before doing the actual import.

Adding Items to a Collection from a zipfile

To add items to a collection, you gather the following information:

...

Testing. You can add --validate (or -v) to the command to simulate the entire import process without actually doing the import. This is extremely useful for verifying your import files before doing the actual import.

Replacing Items in a Collection

Replacing existing items is relatively easy. Remember that mapfile you saved above? Now you will use it. The command (in short form):

...

Code Block
[dspace]/bin/dspace import -r -e joe@user.com -c collectionID -s zipfile_dir -z filename.zip -m mapfile


Deleting or Unimporting Items in a Collection

You are able to unimport or delete items provided you have the mapfile. Remember that mapfile you saved above? The command is (in short form):

...

Code Block
[dspace]/bin/dspace import --eperson=joe@user.com --delete --mapfile mapfile

Other Options

  • Workflow. The importer usually bypasses any workflow assigned to a collection. But add the --workflow (-w) argument will route the imported items through the workflow system.

...

  • Resume. If, during importing, you have an error and the import is aborted, you can use the --resume (-R) flag to resume the import where you left off after you fix the error.

  • Specifying the owning collection on a per-item basis from the command line administration tool

    If you omit the -c flag, which is otherwise mandatory, the ItemImporter searches for a file named "collections" in each item directory. This file should contain a list of collections, one per line, specified either by their handle, or by their internal db id. The ItemImporter then will put the item in each of the specified collections. The owning collection is the collection specified in the first line of the collections file.

    If both the -c flag is specified and the collections file exists in the item directory, the ItemImporter will ignore the collections file and will put the item in the collection specified on the command line.

    Since the collections file can differ between item directories, this gives you more fine-grained control of the process of batch adding items to collections.

UI Batch Import

Info

Available in DSpace 7.4 and above.

...

Note

It is also possible to start an "import" directly from the "Processes" menu.  This allows you to specify additional options/flags which are normally only available to the command-line "import" tool (see documentation above). 

Exporting Items

The item exporter can export a single item or a collection of items, and creates a DSpace simple archive in the aforementioned format for each exported item. The items are exported in a sequential order in which they are retrieved from the database. As a consequence, the sequence numbers of the item subdirectories (item_000, item_001) are not related to DSpace handle or item IDs.

...

Using the -x argument will do the standard export except for the bitstreams which will not be exported. If you have full SAF without bitstreams and you have the bitstreams archive (which might have been imported into DSpace earlier) somewhere near, you could symlink original archive files into SAF directories and have an exported collection which almost doesn't occupy any space but otherwise is identical to the exported collection (i.e. could be imported into DSpace). In case of huge collections -x mode might be substantially faster than full export.


UI Batch Export

Info

Available in DSpace 7.4 and above.

...