SHARE Hackathon and Community Meeting
July 11-14, 2016
Charlottesville, VA
Monday, July 16 – Hackathon Day 1
Jeff Spies
SHARE version 2. More specificity about the contents of the database
Need interfaces for SHARE. SHARE does not want to be an interface to the scholarly work
Data needs discovery and refinement
Rick Johnson
Exciting time to be involved with SHARE
Erin Braswell
OSF work space. Code at GitHub.
Provider -> Harvester -> raw_data -> Normalizer -> normalized_data -> changes -> change_set -> versions -> entities
The Harvester gets the data from the provider. Uses date restrictions to get "new" data. The normalizer creates the values that can go into the SHARE data models.
Title issues: Unicode, LateX, MS Word, foreign languages. Attempt to store the language provided by the provider. Joined fields for titles with multiple titles. Can be stored as a a list n the extra class.
Normalizers can guess title or identifier or DOI. Usually conservative normalizers.
Idea: data inspectors: Write elastic searches to get percentages of populated/vacant fields, by provider, by date range. Would show the density of field values in the normalized data. Could be used to draw control charts of field values density. Mirror the values.
Idea: data inspectors: Identifiers are a problem, often come in "random".
Idea: data inspectors: feed the results back the the providers. The providers may be able to suggestions enhancers to the harvesters and normalizers.
Documents can be updated – provider's id. If the metadata comes in for a record that exists, COS versions the record and provides the most current unless the query asks for versions.
See https://staging-share.osf.io/api/