Old Release
This documentation relates to an old version of DSpace, version 3.x. Looking for another version? See all documentation.
This DSpace release is end-of-life and is no longer supported.
Batch Metadata Editing Tool
DSpace provides a batch metadata editing tool. The batch editing tool is able to produce a comma delimited file in the CSV format. The batch editing tool facilitates the user to perform the following:
- Batch editing of metadata (e.g. perform an external spell check)
- Batch additions of metadata (e.g. add an abstract to a set of items, add controlled vocabulary such as LCSH)
- Batch find and replace of metadata values (e.g. correct misspelled surname across several records)
- Mass move items between collections
- Mass deletion, withdrawal, or re-instatement of items
- Enable the batch addition of new items (without bitstreams) via a CSV file
- Re-order the values in a list (e.g. authors)
For information about configuration options for the Batch Metadata Editing tool, see Batch Metadata Editing Configuration
Export Function
The following table summarizes the basics.
Command used: |
|
Java class: | org.dspace.app.bulkedit.MetadataExport |
Arguments short and (long) forms): | Description |
| Required. The filename of the resulting CSV. |
| The Item, Collection, or Community handle or Database ID to export. If not specified, all items will be exported. |
| Include all the metadata fields that are not normally changed (e.g. provenance) or those fields you configured in the |
| Display the help page. |
Exporting Process
To run the batch editing exporter, at the command line:
[dspace]/bin/dspace metadata-export -f name_of_file.csv -i 1023/24
Example:
[dspace]/bin/dspace metadata-export -f /batch_export/col_14.csv -i /1989.1/24
In the above example we have requested that a collection, assigned handle '1989.1/24' export the entire collection to the file 'col_14.csv' found in the '/batch_export' directory.
Import Function
The following table summarizes the basics.
Command used: |
|
Java class: | org.dspace.app.bulkedit.MetadataImport |
Arguments short and (long) forms: | Description |
| Required. The filename of the CSV file to load. |
| Silent mode. The import function does not prompt you to make sure you wish to make the changes. |
| The email address of the user. This is only required when adding new items. |
| When adding new items, the program will queue the items up to use the Collection Workflow processes. |
| when adding new items using a workflow, send notification emails. |
| When adding new items, use the Collection template, if it exists. |
| Display the brief help page. |
Silent Mode should be used carefully. It is possible (and probable) that you can overlay the wrong data and cause irreparable damage to the database.
Importing Process
To run the batch importer, at the command line:
[dspace]/bin/dspace metadata-import -f name_of_file.csv
Example
[dspace]/bin/dspace metadata-import -f /dImport/col_14.csv
If you are wishing to upload new metadata without bitstreams, at the command line:
[dspace]/bin/dspace metadata-import -f /dImport/new_file.csv -e joe@user.com -w -n -t
In the above example we threw in all the arguments. This would add the metadata and engage the workflow, notification, and templates to all be applied to the items that are being added.
Importing large CSV files
It is not recommended to import CSV files of more than 1,000 lines. When importing files larger than this, it is hard to accurately verify the changes that the import tool states it will make, and large files may cause 'Out Of Memory' errors part way through the process.
The CSV Files
The csv files that this tool can import and export abide by the RFC4180 CSV format. This means that new lines, and embedded commas can be included by wrapping elements in double quotes. Double quotes can be included by using two double quotes. The code does all this for you, and any good csv editor such as Excel or OpenOffice will comply with this convention.
File Structure. The first row of the csv must define the metadata values that the rest of the csv represents. The first column must always be "id" which refers to the item's id. All other columns are optional. The other columns contain the dublin core metadata fields that the data is to reside.
A typical heading row looks like:
id,collection,dc.title,dc.contributor,dc.date.issued,etc,etc,etc.
Subsequent rows in the csv file relate to items. A typical row might look like:
350,2292,Item title,"Smith, John",2008
If you want to store multiple values for a given metadata element, they can be separated with the double-pipe '||' (or another character that you defined in your modules/bulkedit.cfg
file. For example:
Horses||Dogs||Cats
Elements are stored in the database in the order that they appear in the csv file. You can use this to order elements where order may matter, such as authors, or controlled vocabulary such as Library of Congress Subject Headings.
When importing a csv file, the importer will overlay the data onto what is already in the repository to determine the differences. It only acts on the contents of the csv file, rather than on the complete item metadata. This means that the CSV file that is exported can be manipulated quite substantially before being re-imported. Rows (items) or Columns (metadata elements) can be removed and will be ignored. For example, if you only want to edit item abstracts, you can remove all of the other columns and just leave the abstract column. (You do need to leave the ID column intact. This is mandatory).
Editing Collection Membership
Items can be moved between collections by editing the collection handles in the 'collection' column. Multiple collections can be included. The first collection is the 'owning collection'. The owning collection is the primary collection that the item appears in. Subsequent collections (separated by the field separator) are treated as mapped collections. These are the same as using the map item functionality in the DSpace user interface. To move items between collections, or to edit which other collections they are mapped to, change the data in the collection column.
Adding Metadata-Only Items
New metadata-only items can be added to DSpace using the batch metadata importer. To do this, enter a plus sign '+' in the first 'id' column. The importer will then treat this as a new item. If you are using the command line importer, you will need to use the -e flag to specify the user email address or id of the user that is registered as submitting the items.
Deleting Metadata
It is possible to perform metadata deletes across the board of certain metadata fields from an exported file. For example, let's say you have used keywords (dc.subject) that need to be removed en masse. You would leave the column (dc.subject) intact, but remove the data in the corresponding rows.
Performing 'actions' on items
It is possible to perform certain 'actions' on items. This is achieved by adding an 'action' column to the CSV file (after the id, and collection columns). There are three possible actions:
- 'expunge' This permanently deletes an item. Use with care! This action must be enabled by setting 'allowexpunge = true' in
modules/bulkedit.cfg
- 'withdraw' This withdraws an item from the archive, but does not delete it.
- 'reinstate' This reinstates an item that has previously been withdrawn.
If an action makes no change (for example, asking to withdraw an item that is already withdrawn) then, just like metadata that has not changed, this will be ignored.
Migrating Data or Exchanging data
It is possible that you have data in one Dublin Core (DC) element and you wish to really have it in another. An example would be that your staff have input Library of Congress Subject Headings in the Subject field (dc.subject) instead of the LCSH field (dc.subject.lcsh). Follow these steps and your data is migrated upon import:
- Insert a new column. The first row should be the new metadata element. (We will refer to it as the TARGET)
- Select the column/rows of the data you wish to change. (We will refer to it as the SOURCE)
- Cut and paste this data into the new column (TARGET) you created in Step 1.
- Leave the column (SOURCE) you just cut and pasted from empty. Do not delete it.
Common Issues
Metadata values in CSV export seem to have duplicate columns
4 Comments
Ivan Masár
TODO: Document JSPUI search results export (since 1.6), see http://dspace.2283337.n4.nabble.com/Export-Search-Results-Metadata-in-XMLUI-td4657613.html
TODO: Document Import/Export via UI
Lighton Phiri
350
,
2292
,Item title,
"Smith, John"
,
2008
I think that it ought to be mentioned somewhere that the 'collection id' is in actual fact the handle id. metadata-import threw an error when I used collection IDs in my csv file; for some reason, it only accepts fully qualified handle ID.
Error: '31' is not a Collection! You must specify a valid collection for new items
Lighton Phiri
Perhaps there should be a statement somewhere stating that there is a direct link between the number of records in CSV file and inegest time.
I've been experimenting with metadata-import utility and noticed a correlation between the number of records in CSV file and ingest speed --I am guessing the increase in ingest time when one has a lot of CSV files is attributed to the CSV read time. When working with Batch files (CSV files) with 1k records, I noticed a latency of ~524ms between each CSV file... I made the assumption that the file read/parsing is handled by org.dspace.content.MetadataSchema @ Loading schema cache for fast finds.
Lighton Phiri
Working on Windows using DSpace 1.8.2; running metadata-export right within the bin directory generates exported csv file in [dspace] --took a while for me to figure out what was happening. Is this expected behaviour? One would expect the file to be generated in the current directory, or at least a mention of where export file is going to be dumped; but the only output I got on the console is below.
C:\Program Files\dspace\bin>dspace.bat metadata-export -f working-papers.metadata-export-20130228.csv -i 123456789/14
Using DSpace installation in: C:\Program Files\dspace
Exporting community 'Working Papers' (123456789/14)
C:\Program Files\dspace\bin>