Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • For some people this simply means being able to search based on a set of controlled vocabulary terms, usually hierarchical, with assurance that the entries being searched have been consistently tagged.  While some VIVOs have better tagging than others – notably Griffith and Melbourne– tagging has not been assumed to be complete by the vivosearch linked index builder, nor are links to controlled vocabularies given any boost in relevance ranking
    • It's also difficult to imagine that any single vocabulary could be adopted consistently across all VIVOs, much less across a range of research networking platforms
  • For purists, a semantic search must leverage specific ontology relationships and ideally be able to interpret natural language phrasing of queries
    • Such query might be phrased as, "Find all people who have taught courses or received grants in gene therapy"
    • There are a number of problems in setting the bar this high, including the many challenges of interpreting natural language and translating to the available classes and relationships in an ontology – e.g., mapping "received grants" to having realized a principal investigator or co-principal investigator role on some grant. Queries might also transcend logical interpretation to assume computational elements, especially for ranking results – putting the person who has taught 10 courses over another who has taught only 2.  These problems would then be compounded by issues of inconsistent population in the many distributed sources of data.
  • VIVO search aims in the middle –
    • To leverage the structure of VIVO data in the structure of the search index – bringing in a person's publications, affiliations, awards, grants and other related entities to the search index.  In the vivosearch.org prototype, only the organizational affiliation and type are used as facets now, but additional facets on a person's collaborators, research areas, the funding source of grants, or geographic interest could fruitfully be added.
      • Note that the degree to which related information is separately indexed for faceting vs. just included in an alltext field for text searching will likely have a significant impact on indexing speed and hence cost and available frequency of update 
    • To also support text-based search across the corpus of data collected, reflecting the high likelihood that data will be sprinkled with relevant terms throughout and not just through explicit tagging
      • Content gets added to an "all text" field for each entry

Bottom line, this means we have to be able to describe how the vivosearch approach will add value over what might be possible by setting up a Google appliance to crawl the 60 CTSA center websites.

Data quality issues will limit effectiveness, especially for directly linking across sites

Inconsistent coding of data

  • Here our hybrid of structured indexing plus text search can be very helpful for improving recall, and the ability to facet results by type and organization can assist in limiting the number of results to be processed.  Relevance ranking will be challenging, however, and efforts to add additional facets will increase the complexity and decrease the frequency with which updates can be processed in production
  • The ontology-driven approach can also be very helpful by supporting roll-up from more specific local extensions to the level where data are more complete and consistent, even if the volume of results may be large at the more general level. If one institution categorizes people at a very detailed level, while most do not, vivosearch will only provide granular results down to the level of classes in the vivo core ontology.

Different URIs for the same entities in different source data sets

 

  • This is the biggest likely stumbling block for effective long-term use of a multi-institutional search.
    • First, namespace issues – data about a person at Harvard harvested from the UF VIVO will have a UF namespace.  This is good for provenance but may be confusing to users
    • Name disambiguation issues – data will not likely carry enough information with it to be disambiguated without further processing