Date & Time

  • October 11th 15:00 UTC/GMT - 11:00 ET

This call is a Community Forum call: Sharing best practices and challenges in the use of existing DSpace features

Dial-in

We will use the international conference call dial-in. Please follow directions below.

  • U.S.A/Canada toll free: 866-740-1260, participant code: 2257295
  • International toll free: http://www.readytalk.com/intl 
    • Use the above link and input 2257295 and the country you are calling from to get your country's toll-free dial in #
    • Once on the call, enter participant code 2257295

Agenda

Community Forum Call: DSpace Importing and Bulk Metadata Editing

Sharing best practices, challenges, and questions

  • DSpace Importing and Bulk Metadata Editing
    • Building simple archive format structures/folders
    • Working with the spreadsheet bulk editing tool
    • Command line imports

 

Preparing for the call

Bring your questions/comments you would like to discuss to the call, or add them to the comments of this meeting page.

If you can join the call, or are willing to comment on the topics submitted via the meeting page, please add your name, institution, and repository URL to the Call Attendees section below.

Meeting notes

Batch Metadata Editing

DSpace offers a default batch metadata editing feature which allows administrators to export metadata in a CSV file. This CSV file can be imported in a spreadsheet application, after which the metadata can be altered. After editing, administrators can reconvert the file to a CSV file, and import it back in DSpace.

Georgetown University created several tools as an extension of the standard DSpace batch editing functionality. These tools will become part of the DSpace 6 codebase.

Georgetown University created tools for:

UTF8 encoding issue

When using the batch metadata functionality, metadata sometimes gets corrupted when the CSV file is imported in a spreadsheet application. This is caused by some characters not being imported correctly as UTF8, which automatically results in an erroneous metadata value when the metadata is exported as a CSV file. Even if the metadata value was not altered.

According to DCAT this is due to a lack of correct encoding support by (certain)spreadsheet applications.

Openrefine

Throughout the discussion participants often mentioned OpenRefine (http://openrefine.org/) as a great application for editing CSV exports. This tool could be interesting to such an extend it may be useful to organize a workshop on the application. This workshop could be an extension of the OpenRefine workshop organized by Code4lib. DCAT members having more information on the Code4lib Openrefine workshop are invited to share their knowledge, or (links to) any affiliated documents, in the comments.

DSpace Bulk ingest & export

Simple Archive Format

DSpace offers bulk ingest through Simple Archive Format. This is an archive containing a directory for each item. Each item directory consists out of a file containing the file's metadata together with all of the item's bitstreams.

Exporting search results

DSpace 6 will come with new bulk exporting functionality, being a new tool allowing to export search results.

Blank spaces

There was the concern of blank values being introduced after exporting a CSV file out of a spreadsheet application. While in the original CSV file there was no value for a certain metadata field, there may be a blank value in the CSV file exported out of the spreadsheet editor. This however should not not be a problem as it is unlikely the DSpace batch metadata editing tool will insert a value for this blank when the CSV fil is imported in DSpace.

Call Attendees

  • No labels

22 Comments

  1. The following Wiki Documentation is very useful when considering the bulk edit feature

    If it is useful to the discussion, the following links describe how the Georgetown University Library uses the bulk edit/bulk ingest features.

    1. If the language field is inconsistent in your metadata, it may need to be normalized for the metadata update to work properly.  See http://stackoverflow.com/questions/27277594/dspace-how-does-text-lang-get-set-during-item-submission/27280569#27280569

      1. In response to the question about language, here is the SQL we use for normalizing language.  This will need to be updated in DSpace 6.

        Language Normalization
        update metadatavalue 
        set text_lang='en' 
        where text_lang is null 
        and resource_type_id=2;
         
        update metadatavalue 
        set text_lang='en' 
        where text_lang in ('','en_US', 'en-US','en_us') 
        and resource_type_id=2;
        
        
        
        
        
        
        1. Terry was this supposed to be a link?

          1. It is a code snippet.  Are you having trouble viewing it?

    2. Similar to other suggestions made on the call, we use the following tool to load CSV data to Google Sheets to avoid auto correction of text: https://github.com/Georgetown-University-Libraries/PlainTextCSV_GoogleAppsScript

      This process has also reduced some of character encoding issues for us.

  2. I'll have to give my apologies, two urgent data curation requests, sorry.

    1. I also recommend the following for batch import of metadata (CSV file) and files. They both make upload-ready ZIP files.

      PySAF (easy to use): https://github.com/cstarcher/pysaf

      SAFCreator: https://github.com/jcreel/SAFCreator

      1. James Silas Creel ran a workshop on the SAFCreator this past spring 2016 at the Texas Conference on Digital Libraries. Slides from that talk are linked from here: https://conferences.tdl.org/tcdl/index.php/TCDL/TCDL2016/paper/view/916 

  3. Best practices:

    I use the Export Metadata feature in DSPACE a lot to make bulk metadata changes and have found a few tricks that help us do this efficiently.  

    1.  We've encountered issues when opening the CSV with Excel, because of encoding issues that corrupt special characters in the metadata.   As a result, we are always careful to open these files in LibreOffice Calc and set UTF-8 encoding.  
    2. We're also very careful to delete metadata columns that are not being changed so that we don't introduce any accidental changes to metadata that we aren't intending to change.
  4. Hearing interest on this call for tips, tools, and approaches for metadata clean-up... 

    1. Including: OpenRefine, cheat sheet of regular expressions, how to import into Excel or LibreOffice without corrupting dates or other encoding.

      1. Agree that OpenRefine is very useful in finding and fixing anomalous metadata.  Great for finding issues with self-submissions by faculty.

    2. I like your workshop suggestion.  I would be happy to assist.

    3. Also happy to help with workshop or further discussions

  5. For shuffling metadata around, including author name reversal:

    Code4lib tutorial for Open Refine from a workshop at University of Toronto - https://github.com/code4libtoronto/2016-07-28-librarycarpentrylessons/blob/master/openrefine/OpenRefine.md

    If you google Open Refine for metadata editing, you may find solutions and steps that have already been documented by other librarians.

     

  6. Patterns: helpful for working with/learning regular expressions (Mac only)

    http://krillapps.com/patterns/

     

    I'm also very interested in helping put together a workshop or larger discussion about tools and workflows.

    1. Here's a web tool for learning and testing regex: http://regexr.com/

  7. Our OAI-PMH validator: http://validator.rcaap.pt/validator2/?locale=en (it's a public service you can use).