Date & Time
- October 11th 15:00 UTC/GMT - 11:00 ET
This call is a Community Forum call: Sharing best practices and challenges in the use of existing DSpace features
Dial-in
We will use the international conference call dial-in. Please follow directions below.
- U.S.A/Canada toll free: 866-740-1260, participant code: 2257295
- International toll free: http://www.readytalk.com/intl
- Use the above link and input 2257295 and the country you are calling from to get your country's toll-free dial in #
- Once on the call, enter participant code 2257295
Agenda
Community Forum Call: DSpace Importing and Bulk Metadata Editing
Sharing best practices, challenges, and questions
- DSpace Importing and Bulk Metadata Editing
- Building simple archive format structures/folders
- Working with the spreadsheet bulk editing tool
- Command line imports
Preparing for the call
Bring your questions/comments you would like to discuss to the call, or add them to the comments of this meeting page.
If you can join the call, or are willing to comment on the topics submitted via the meeting page, please add your name, institution, and repository URL to the Call Attendees section below.
Meeting notes
Batch Metadata Editing
DSpace offers a default batch metadata editing feature which allows administrators to export metadata in a CSV file. This CSV file can be imported in a spreadsheet application, after which the metadata can be altered. After editing, administrators can reconvert the file to a CSV file, and import it back in DSpace.
Georgetown University created several tools as an extension of the standard DSpace batch editing functionality. These tools will become part of the DSpace 6 codebase.
Georgetown University created tools for:
- Replacing metadata for all items in a collection
- Replacing metadata by query
- Generating bulk ingest folders (metadata and media)
UTF8 encoding issue
When using the batch metadata functionality, metadata sometimes gets corrupted when the CSV file is imported in a spreadsheet application. This is caused by some characters not being imported correctly as UTF8, which automatically results in an erroneous metadata value when the metadata is exported as a CSV file. Even if the metadata value was not altered.
According to DCAT this is due to a lack of correct encoding support by (certain)spreadsheet applications.
Openrefine
Throughout the discussion participants often mentioned OpenRefine (http://openrefine.org/) as a great application for editing CSV exports. This tool could be interesting to such an extend it may be useful to organize a workshop on the application. This workshop could be an extension of the OpenRefine workshop organized by Code4lib. DCAT members having more information on the Code4lib Openrefine workshop are invited to share their knowledge, or (links to) any affiliated documents, in the comments.
DSpace Bulk ingest & export
Simple Archive Format
DSpace offers bulk ingest through Simple Archive Format. This is an archive containing a directory for each item. Each item directory consists out of a file containing the file's metadata together with all of the item's bitstreams.
Exporting search results
DSpace 6 will come with new bulk exporting functionality, being a new tool allowing to export search results.
Blank spaces
There was the concern of blank values being introduced after exporting a CSV file out of a spreadsheet application. While in the original CSV file there was no value for a certain metadata field, there may be a blank value in the CSV file exported out of the spreadsheet editor. This however should not not be a problem as it is unlikely the DSpace batch metadata editing tool will insert a value for this blank when the CSV fil is imported in DSpace.
Call Attendees
- Maureen Walsh - The Ohio State University
- Ignace Deroost - Atmire
- Irene Berry - Naval Postgraduate School
- Anna Dabrowski - Texas A&M University (http://oaktrust.library.tamu.edu)
- Terrence W Brady - Georgetown University
- Mariya Maistrovskaya - University of Toronto
- Jose Carvalho
- Filipe Furtado
- Valerie Collins - University of Minnesota
- Marianne Reed - University of Kansas
- Felicity Dykas - University of Missouri–Columbia
- Anne Lawrence - Virginia Tech
- Sarah Potvin - Texas A&M University
- Daniel Draper - Colorado State University
- Iryna Kuchma - EIFL
- Elias Tzoc - Miami University
- Monica Rivero - Rice University
- Susan Borda - Montana State University
- Bill Kelm - Willamette University
22 Comments
Terrence W Brady
The following Wiki Documentation is very useful when considering the bulk edit feature
If it is useful to the discussion, the following links describe how the Georgetown University Library uses the bulk edit/bulk ingest features.
Terrence W Brady
If the language field is inconsistent in your metadata, it may need to be normalized for the metadata update to work properly. See http://stackoverflow.com/questions/27277594/dspace-how-does-text-lang-get-set-during-item-submission/27280569#27280569
Terrence W Brady
In response to the question about language, here is the SQL we use for normalizing language. This will need to be updated in DSpace 6.
Susan Borda
Terry was this supposed to be a link?
Terrence W Brady
It is a code snippet. Are you having trouble viewing it?
Terrence W Brady
Similar to other suggestions made on the call, we use the following tool to load CSV data to Google Sheets to avoid auto correction of text: https://github.com/Georgetown-University-Libraries/PlainTextCSV_GoogleAppsScript
This process has also reduced some of character encoding issues for us.
Pauline Ward
I'll have to give my apologies, two urgent data curation requests, sorry.
Maureen Walsh
We use SAFBuilder for batch loading: https://github.com/DSpace-Labs/SAFBuilder and Simple Archive Format Packager
And we regularly use batch metadata editing: http://kb.osu.edu/dspace/handle/1811/47279
Anna Dabrowski
I also recommend the following for batch import of metadata (CSV file) and files. They both make upload-ready ZIP files.
PySAF (easy to use): https://github.com/cstarcher/pysaf
SAFCreator: https://github.com/jcreel/SAFCreator
Sarah Potvin
James Silas Creel ran a workshop on the SAFCreator this past spring 2016 at the Texas Conference on Digital Libraries. Slides from that talk are linked from here: https://conferences.tdl.org/tcdl/index.php/TCDL/TCDL2016/paper/view/916
Marianne Reed
Best practices:
I use the Export Metadata feature in DSPACE a lot to make bulk metadata changes and have found a few tricks that help us do this efficiently.
Sarah Potvin
Hearing interest on this call for tips, tools, and approaches for metadata clean-up...
Sarah Potvin
Including: OpenRefine, cheat sheet of regular expressions, how to import into Excel or LibreOffice without corrupting dates or other encoding.
Sarah Potvin
Anna Dabrowski recommends Patterns Regex App for Macs: https://itunes.apple.com/us/app/patterns-the-regex-app/id429449079?mt=12
Marianne Reed
Agree that OpenRefine is very useful in finding and fixing anomalous metadata. Great for finding issues with self-submissions by faculty.
Terrence W Brady
I like your workshop suggestion. I would be happy to assist.
Anna Dabrowski
Also happy to help with workshop or further discussions
Mariya Maistrovskaya
For shuffling metadata around, including author name reversal:
Code4lib tutorial for Open Refine from a workshop at University of Toronto - https://github.com/code4libtoronto/2016-07-28-librarycarpentrylessons/blob/master/openrefine/OpenRefine.md
If you google Open Refine for metadata editing, you may find solutions and steps that have already been documented by other librarians.
Anna Dabrowski
Patterns: helpful for working with/learning regular expressions (Mac only)
http://krillapps.com/patterns/
I'm also very interested in helping put together a workshop or larger discussion about tools and workflows.
Susan Borda
Here's a web tool for learning and testing regex: http://regexr.com/
Monica Rivero
Sarah Potvin mentions UNT name app, http://digital2.library.unt.edu/name/ and code at https://github.com/unt-libraries/django-name
Jose Carvalho
Our OAI-PMH validator: http://validator.rcaap.pt/validator2/?locale=en (it's a public service you can use).