SHARE Hackathon and Community Meeting
July 11-14, 2016
Charlottesville, VA
Monday, July 16 – Hackathon Day 1
Jeff Spies
SHARE version 2. More specificity about the contents of the database
Need interfaces for SHARE. SHARE does not want to be an interface to the scholarly work
Data needs discovery and refinement
Rick Johnson
Exciting time to be involved with SHARE
Erin Braswell
OSF work space. Code at GitHub.
Provider -> Harvester -> raw_data -> Normalizer -> normalized_data -> changes -> change_set -> versions -> entities
The Harvester gets the data from the provider. Uses date restrictions to get "new" data. The normalizer creates the values that can go into the SHARE data models.
Title issues: Unicode, LateX, MS Word, foreign languages. Attempt to store the language provided by the provider. Joined fields for titles with multiple titles. Can be stored as a a list n the extra class.
Normalizers can guess title or identifier or DOI. Usually conservative normalizers.
Idea: data inspectors: Write elastic searches to get percentages of populated/vacant fields, by provider, by date range. Would show the density of field values in the normalized data. Could be used to draw control charts of field values density. Mirror the values.
Idea: data inspectors: Identifiers are a problem, often come in "random".
Idea: data inspectors: feed the results back the the providers. The providers may be able to suggestions enhancers to the harvesters and normalizers.
Documents can be updated – provider's id. If the metadata comes in for a record that exists, COS versions the record and provides the most current unless the query asks for versions.
See https://staging-share.osf.io/api/
Tuesday – Hackathon Day 2
Wrote the share-data-inspector. Upload to GithHiub and provide link here
Wednesday – Community Meeting Day 1
Keynote Siva Vaidhyanathan, UVa – The Operating System of Our Lives: How Google, Facebook and Apple plan to manage everything
Relationships with technology and information and communication changing rapidly. Mapping a game onto reality, engaging millions of people immediately into a game – Pokemon Go. Facebook Live – mapping reality into the virtual world, immediately, effortlessly, in real-time. Facebook took the video down for an hour, did not anticipate the incident of violence. 1.6B users, leading source of news for many millions of people. Facebook matches content to people. Facebook denying its level of power in the world. Google has the same position – constantly underplaying its role in pointing people at information.
We are collectively dependent on Google.
"The web is dead" – flows of data are not open docs loosely joined. Most data is moving through proprietary devices and formats. Our concept of the Internet is flawed/primitive. We have never been comfortable with the concepts of radical openness. Internet described in terms of place based metaphors "cyberspace" "Internet superhighway." Mobile devices changed that.
Apple sells boxes. Microsoft sells software. Amazon a retailer, largest source of revenue is AWS. Facebook sells connectivity to people. Google sells connectivity to information. Compete for labor, political power, advertising revenue, attention. Each has a plan to "win the game" – to become the operating system of our lives. Put things on our bodies, drive our cars, fully imbedded in our bodies. Data flows must be proprietary and controlled. Can not be open/standard.
Internet of Things – forget it. Seems helpful. The important thing is the monitor and managing of people. Us. Companies must have a lot of knowledge about us. Difficult to enter the market – these companies have 18 years of data on us.
Edward Snowden showed us the data the government is collecting, and the purposes they have for the data. State actors are not benign, and often result in violence. Chinese government in full association with its social media companies. All states are excited by Modi, Putin, Erdogan, and the work they are doing on surveillance. Surveillance will increase.
We have voices as citizens. The Googlization of Everything.
Breakout – SHARE Notify Atom Feed
https://osf.io/share 117 providers, 7 million records. Clinical Trials.gov Zenodo, PLoS, Arxiv.org, Figshare, and 50 instititional providers.
How might we use the data:
- The VIVO Use Case – showcase the work of the people at an institution. All the work.
- Track work over time – increase/decrease of various kinds of work.
- Check work over time – what do we know, what does SHARE know?
- Understand the social network of scholarship – who works with who across institutions across the world
- Understand the trajectory of scholarship – what areas are emerging, what areas are receding?
Atom Query String
http://osf.io/share/atom/?q=(shareProperties.source:asu)AND(title:"fish")
http://osf.io/share/atom/?q="maryann martone"OR"maryann e martone"
http://osf.io/share/atom/?q="m conlon"OR"Michael Conlon"
http://Blogtrottr.com for sending a feed digest to a mail address on a regular schedule.
Breakout session – related projects
Gary Price, Infodocket. Find more users for SHARE – high school students. Include press references to research. Semantic Scholar.
Karen Hanson, Portico, Ithaka, RMap. DiSCO – distributed scholarly compound object. Linked Open Data. Very cool. Discos can related to each other. Each disco has an immutable identifier (URI) that points at the Disco. Assertions about the resources. No ontology restrictions. Discos have a status. Using known ontological elements for connections. OSF Person to OSF Project to Datacite URI, linked. Plug in a DOI and see a graph of what RMap knows of that resource. IEEE was a sponsor, used IEEE data on publications to help validate RMAP. Has RDF representation of each DiSCO. End of grant, all tools will be open source. http://rmap-project.info
Lisa Johnson, University of Minnesota. Data Curation Network. Rise of Data Sharing Culture. Role of librarians – discipline specific expertise, technology expertise.
Data curation network: Minn, Cornell, PennState, Illinois, Michigan, WUSTL. Collecting and reporting data curation experiences, metrics for results. http://sites.google.com/DataCurationNetwork
Anita de Waard, Elsevier
Hackathon Report back
Institutional Dashboard
Data Inspector
Metadata documentation
Research Data Discovery in Share
Data is coming from DataCite. Is there a data type for datasets? Yes, but perhaps not in the API yet?
Quality of data? Depends on the provider. Level of curation varies.
Sharing and discovering artifacts of the research process? Some artifacts can not be shared – proposals before funded. Data management plans before funding.
Does DataCite totally duplicate Dryad for data set consumption? Metadata might be different. Similar questions applyu to other overlapping services – Dataverse and DataCite.
VIVO and SHARE
Alexander Garcia Castro
SHARE is chaotic and promiscuous. VIVO is chaste, great precision.
Research Hub
SHARE Scopus Mendeley GitHub
Match and claim
Search -> Claim -> Add -> Connect Research Objects -> Social Connections -> Done
VIVO needs and engagement strategy. Beautiful, clear models, open, reusable semantic data.
SHARE is big, but messy. Also needs an engagement strategy.
Mendeley, ResearchGate. Giving researchers something. OpenVIVO has a bit more, but still very little.
Thursday, Community Meeting Day 2
Jeff Spies, Scholarly Workflow
OSF as a platform for scholarly workflow. Slides available here: http://osf.io/9kcd3
MC Needs:
- Identity
- Extensible/local workflow
- Github issues
Prue Adler, Brandon Butler, Metadata Copyright Legal Guidance
Copyright protects the original expression of the authors. Modicum of creativity, independent creation. Copyright does not protect facts, ideas, discoveries, systems. Effort, time, expertise are irrelevant in the US, not in the UK and EU.
Merger doctrine – if the idea can be expressed in only a limited number of ways, the expression merges w/ fact and is unprotected.
Selection and arrangement of facts can be protected if creative and original.
No copyright in words, titles, and short phrases.
No copyright in blank forms (psychometrics – perhaps this is a patentable method)
MC: VIVO Project was able to work with Web of Science and SCOPUS to clarify which facts in their databases were public domain and which were not. Public Domain facts can be harvested from these systems and used in VIVO systems, effectively making the facts open and reusable by others.
Contracts can restrict reuse regardless of contract.
Copyright applies for 70 years after the author(s) death.
Brian Nosek – Research Integrity
Signals – open data, open materials, preregistered. Badges are stupid, but signals helpful.
3% of articles had recognition of open data, two years later 40% have open data. PSCI journal.
http://cos.io/top Top guidelines. 713 journals. 62 organizations in the process of review and adoption of the guidelines.
Two modes of research: exploratory, confirmation
Preregistration challenge: http://cos.io/prereg
Registered reports: