Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

  • This is the biggest likely stumbling block when the immediate benefits of searching multiple distributed databases have been realized and users start to expect the logical next steps – being able to search for data in common across the different source institutions, and especially for connections linking researchers at one institution with colleagues at another
    • This broader and more ambitious goal is a compelling one for CTSAs (NIH Clinical and Translational Science Award sites) due to the explicit mandates from NIH and expectations from Congress that CTSAs will be able to show evidence of increased collaboration across as well as within CTSAs.
  • The fundamental challenge is that any person, funding or research organization, conference, publisher, or journal appearing in the data from more than one institution will have multiple different URIs, and the data will not likely carry enough information to support disambiguation without further analysis and processing.
    • The fact that data about a person at Harvard harvested from the UF VIVO will have a UF namespace is good for provenance and will help with disambiguation but may be confusing to users, especially when it's not obvious which URI is most authoritative, as with events, organizations, journals, or funding agencies.
  • It remains to be seen what the most effective way to approach the disambiguation task will be – very likely this will depend on priorities, with the disambiguation of researchers at subscribing institutions likely the highest priority but also potentially the most difficult.
  • Third party information such as ORCIDs, Scopus or Researcher Id records, and VIAF will be very relevant but not uniformly populated

Strategy?

  • With a finite body of data for a fixed set of institutions, it may be possible to develop a disambiguation approach based on the entire corpus, including dealing with incremental changes as new people arrive or leave the consortium. For an open-ended VIVO search with new institutions joining on a regular basis, other strategies might come into play
    • The linked data index builder currently discards the RDF it harvests as each Solr document is created.  While the resulting triple store would get large, one could keep all the RDF and run queries and analysis against the whole body of data to find duplicates and create sameAs statements where statistical evidence seems to warrant it.  This strategy could perhaps optimally deal with 75-80% of duplication where enough information can be collected to support a analysis to a reasonable level of confidence, but it should be noted that several large and very well funded commercial organizations devote considerable resources to the same task, albeit at larger scale. The remaining unresolvable data could be very problematic for confidence in this more ambitious service.
    • The AgriVIVO project will use the information harvested from distributed research profiling systems into a common VIVO to offer services back to participating organizations supporting disambiguation at the time of interactive data entry and editing. These lookup/suggest services will start with geographic names and Agrovoc terms, where web services are already available, but are planned to be extended to organizations, people, journals, events, funding agencies, and projects in the future. Note this again is predicated on having a central triple store.
    • A less comprehensive effort could perform some analysis during the processing of RDF to "accumulate" entities thought to be distinct in an independent database targeted solely at entity disambiguation. This approach could enable performing the more tractable disambiguation without necessitating resources to hold and analyze the larger body of RDF, thereby also avoiding concerns about holding full copies of institutional data; it could similarly support services offering suggestion and/or pick lists from the central disambiguation database to participating research networking systems.
  • It will be important to look ahead to the disambiguation issues and anticipate what aspects of indexing for VIVO search could either help or hinder future disambiguation processes, but it's also important to move ahead with the search, not only to produce the immediate benefits that will provided, but to support further analysis based on real data rather than speculation.

...