11/30/2011 - Hosted by Library of Congress
Cliff Lynch - Keynote
- Repositories ("reps") gained traction in the 90s
- Open Access movement is one origin of repositories ("author self-archiving") - Put journal articles in the institutional repository
- Disciplinary repositories - Desire to circulate pre-prints as well as open access
- In Europe, when articles were under copyright, authors put citations in the institutional repository
- Coverage at most institutiions is pretty disappointing in absence of mandates; people argue that inst. reps have been a failure
- Other view of Inst Reps: Faculty now producing many artifacts that don't fall into category of text, journal articles. Repositories should provide a safe place for this material: datasets, powerpoints, software, pre-prints, image collections, inputs to research, outputs, stuff used in teaching
- Hard to measure success with this view; very different from goals of open access movement; no one can quantify how much of this material exists; little understanding of life-cycle patterns. when is it appropriate the stuff into inst rep; short time-frame not adequate for measuring success
- Thinking that non-institutional orgs should have repositories too: NGOs, gov agencies, libraries, other non-profits; lot of room for development at this point, not much uptake
- Substantial marketplace has emerged for inst reps: commercial and open source, repositories as a service; barriers are reduced
- Many variations: consortial reps, disciplinary reps.
- We are wrestling with the question of when it makes sense to do things on a disciplinary basis vs institutional basis; real advantages to disciplinary approach (sub-disciplines, special vocabularies, etc); but inst reps are the backstop right now; many disciplines will not be represented by repositories in near future; disciplinary reps tend to want to restrict to limited number of object types;
- Challenges: What sort of metadata should be included? Conflicts between generality (apply to many cases) and specificity (allow easy discovery of specialized resources);
- Metadata demands from libraries have made it a burden for faculty to deposit materials
- Search engines tend to ignore third-party metadata;
- Aggregation of metadata across repositories is important
- Need for sorting through author identity and cleaning up names
- How should inst reps relate to storage of research data; are they appropriate for short-term storage of research data? (research in high perf computing, eg); should there be a staging area before data is preserved? we don't understand these questions very well;
- Challenges for "small data" are just as great as big data
- How do we move from a patchwork of inst and disciplinary reps into a network of repositories? Some faculty have choice of both, shouldn't have to deposit twice; has been a challenging issue to migrate, though; Functional requirements of interoperable reps: extract metadata (there is a standard); make automated deposit through a protocal interface; should be able to copy from one rep to another (different from a deposit? Replication? Object Reuse Exchange Protocol used for this but is complicated);
- Repository discovery and naming: material needs to be accessible in long run; reps should assign unique, persistent IDs; how do you find reps? You want to refer to repository at inst rather than URLs or specific hosts; we need registries of reps, lookup and discovery mechanisms
- Major issue: how do you cite data used in scholarly work, make reference to data in tables, how do we make correspondences between journal articles and data?
- Institutional stewardship is a long-term commitment; they aren't always honored; repositores sometimes go away; stewards need to accommodate that reality
- Questions:
- Value of DataNet? Cliff: The projects are capacity building and integrative, linking repositories together, providing tools to act on them;
- Lots of distributed efforts, where should the focus be? Cliff: It's complex: many scholarly communities are stakeholders, public uses the materials; Difficult to pull everything together while respecting specialization; OR conference helps bring people together; DataCite, ORCID are helpful efforts; CNI tries to provide a home, National Academies tries to pull together science communities; international data curation conference has been good venue; Chief Research Officers at universities don't seem to have a meeting like other university execs
Case Studies- Jerry Sheehan
- Delivering Data in Science (March) - in Paris
- #stirepos is today's hashtag
- Jane Greeberg - UNC Chapel Hill (Dryad)
- Dryad is a collaborative, run by a consortium of journals
- Objectives: repository for research underlying peer-reviewed publications in basic and applied sciences
- Partnership with journals, which have a data archiving policy
- Dryad associated with DataOne
- Data built on DSpace; work with @mire, "the company that oversees the DSpace software"
- Federated searching with TreeBASE and KNB LTER
- Dr. Ian Bruno - Cambridge Crystallographic Data Cetner
- study of molecules - use in drug design and development
- 140 industrial subscribers sustain their efforts
- Sustainability is still an issue; big pharma has been impacted financially; have competition with commercial apps
- Fuzziness over where ownership rests
- Value is added for subscription service; resentment that data is not open
- H.K. Ramapriyan, Earth Science Data and Information Systems Project, NASA Goddard Space Flight Center - EOSDIS
- Earth observing satellites and earth science measurements
- Mission is to meet the challenges of climate and environmental change
- Data is available at no cost; EOSDIS provides data processing, management, interoperable data archives
- Satellite data is captured by flight operations, processed, sent to multiple data centers
- Other sources of data too; multiple types of instruments
- Middleware and associated clients provide search and access to data across al data centers
- Distributed data centers handle different types of data (e.g., National Snow & Ice Data Center)
- There is global directory; all datasets are discoverable; cross data center searches through REVERB
- Many data visualization and analysis tools
- 5.1 petabytes of data
- DSpace@MIT
- PubMed Central
- Electronic extension of NLM's print journal archive
- Free access;
- Deposit Paths: publisher sends final article in XML or author sends manuscript file, it's processed and NIH creates XML, then deposited
- NLM has formal agreements with publishers (final copy, deposits are permanent, publishers can't withdaw content
- They have non-exclusive license to use the content; they don't own it
- Author must retain rights to manuscript before signing publication agreement
- PubMed DTD now a NISO standard
- Library of Congress
- LOC is a holder of large datasets that are used in research (e.g., Twitter)
- Mandatory Copyright Deposit now bringing in many new files
- So far their system is discovery and delivery; lacking many repository features
- DataCite and EZID
- Creates a global citation framework for data
- Uses DLI (Digital Logic Identifier)
- Take a lifecycle approach www.cdlib.org
- UC3DCXL - open source add-in for Microsoft Excel as a data collection tool
- n2t.net/ezid (create an ID)
- ORCID - Brian Wilson, Thomson Reuters
- Open Researcher and Contributor ID
- Allows reliable attribution of authors and contributors
- ORCID allows you to create a profile associated with your ID
- 282 participating orgs internationally; academic, publishers, government, societies, non-profits, etc.
- Mellon granted award to MIT, Harwaverd, Cornell to study ORCID business models
- VIVO awarded grant to ORCID for collaborative research
- Just released first code
- Will hire executive director and technical director
- Institutional seeding of profiles, delegated management
- To use OAuth (used by Google, Facebook)
- self asserted, socially validated, organizationally asserted identity = more credible assertion
- Chris Greer - NIST
- Promte infrastructures as well as standards; consider themselves part of the "data community"
- Missing: a framework for the community to make decisions;
- Draws comparison to NISTs mandate to design interoperable smart grid; information requirements are similar; many different stakeholders; Smart Grid Interoperability Panel - consensus based organization; 724 members; architecture committee, testing and validation, security; stakeholders include standards bodies, regulators; participation is voluntary, but you must participate -- miss sequential mtgs or votes and you're voted off the island
- If the data community did this, NIST would be the convener, would have White House support