Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Confusion over what semantic search means

There appears to be little consensus about what semantic search means.

  • For some people this simply means being able to search based on a set of controlled vocabulary terms, usually hierarchical, with assurance that the entries being searched have been consistently tagged.  While some VIVOs have better tagging than others – notably Griffith and Melbourne– tagging has not been assumed to be complete by the vivosearch linked index builder, nor are links to controlled vocabularies given any boost in relevance ranking
    • It's also difficult to imagine that any single vocabulary could be adopted consistently across all VIVOs, much less across a range of research networking platforms
  • For purists, a semantic search must leverage specific ontology relationships and ideally be able to interpret natural language phrasing of queries
    • Such query might be phrased as, "Find all people who have taught courses or received grants in gene therapy"
    • There are a number of problems in setting the bar this high, including the many challenges of interpreting natural language and translating to the available classes and relationships in an ontology – e.g., mapping "received grants" to having realized a principal investigator or co-principal investigator role on some grant. Queries might also transcend logical interpretation to assume computational elements, especially for ranking results – putting the person who has taught 10 courses over another who has taught only 2.  These problems would then be compounded by issues of inconsistent population in the many distributed sources of data.
  • VIVO search aims in the middle –
    • To leverage the structure of VIVO data in the structure of the search index – bringing in a person's publications, affiliations, awards, grants and other related entities to the search index.  In the vivosearch.org prototype, only the organizational affiliation and type are used as facets now, but additional facets on a person's collaborators, research areas, the funding source of grants, or geographic interest could fruitfully be added.
      • Note that the degree to which related information is separately indexed for faceting vs. just included in an alltext field for text searching will likely have a significant impact on indexing speed and hence cost and available frequency of update 
    • To also support text-based search across the corpus of data collected, reflecting the high likelihood that data will be sprinkled with relevant terms throughout and not just through explicit tagging

Bottom line, this means we have to be able to describe how the vivosearch approach will add value over what might be possible by setting up a Google appliance to crawl the 60 CTSA center websites. 

Data quality issues will limit effectiveness, especially for directly linking across sites

...