Issues

Business model issues

Basic elements to determine operational (not development) cost

Amount of data from the institution
- this affects the processing time that needs to be allocated as well as the increment to the size of the index
Frequency of update (again based on the processing and oversight/validation required for indexing)
Support
- providing feedback on bad data, especially to people new to ontologies and RDF
- addressing performance issues at the distributed data sources (especially if harvesting degrades the function of their production VIVO app)
- there will have to be a startup fee with some number of hours of support included, and then the ability to redirect further support to a list of consultants or companies willing to provide help

Relationship of services to sponsorship

It will be much cleaner to separate sponsorship from participation in production services.

An institution sponsors VIVO to support the effort, as well as influence and hasten its development
An institution signs up for a service if you want your data to be included in that service
Question: what will our policy be around in-kind development/testing/requirements gathering?
- Simplest answer: the institutions contributing in-kind will have the most influence, but not exclusive influence or veto power. They should be contributing based on the importance of search to themselves, both in general and/or specific features
- Question: how does the Kuali community handle this? I believe from what Penn representatives said that they have two distinct aspects to sponsorship:
  - join the Kuali foundation where an investment in all the infrastructure, and pay for the legal entity
  - then contribute to project costs for the product you want to use
    - e.g., participate on the governance, technical, and functional councils of the OLE project
    - as well as strategic governing board

What questions does this leave unanswered?

If an institution has no interest in sponsoring VIVO as software – say, if they run Profiles or another tool – but they want to sponsor development and ongoing improvement of the VIVO search tools, do we have a special category of sponsorship for that?
- Answer: we already offer the standard DuraSpace bronze ($2500), silver ($5,000), and gold ($10,000) levels for VIVO, but these do not included participation or voting rights on the VIVO Sponsors Committee (see prospectus)
- Follow-up question: how will non-voting sponsors affect direction/priorities for any aspect of VIVO?
Open question: will there be a forum for sponsors to address priorities for search?
Open question: will VIVO search be governed differently than VIVO?

Areas that may get messy

Founding sponsors may expect not to be nickel & dimed
Balancing in-kind support, sponsorship, and service fees
VIVO multi-institutional search is not entirely separable from internal search at one VIVO institution
- There were pilot efforts to extend local search to the 8-institution index in Spring, 2011
- This will likely come up again on wish lists
- There may be interest in doing this from other platforms
  - Is the VIVO searchlight relevant here?
  - Would the OpenSocial platform-neutral approach be relevant?

Limits to what we can charge

The code is all open – universities may prefer to run their own (in which case you request that they sponsor to help keep the code updated)
Service providers may decide they could host competitive search services, with their own value added in tweaks to relevance ranking, etc.
As the price goes up, a cost benefit analysis will steer people to other options including custom Google search appliances, etc.

Technical Risks

Indexing is too slow

This could be a problem for two reasons

Indexing consumes too many resources and is costly to support
Updates cannot happen with sufficient frequency to satisfy local requirements or incentivize updates
- People have more confidence in a system which they can control themselves by seeing corrections happen

What will contribute to slowness?

Indexing more detail, especially detail that is more remotely connected to the individual entity being indexed
- e.g., if you want to get the names of all co-authors for a researcher, not just the titles of their papers
Doing more to align or disambiguate data at indexing time
- e.g., including in the index a list of possible alternative organizations, people, journals, etc. to facilitate corrections during use
Doing more queries, computation or analysis to improve relevance ranking
- e.g., boosting relevance based on the number of papers or grants

Indexing interferes with performance on the site being indexed

This may become the countervailing pressure from distributed sites --

Functional risks

Relevance ranking proves intractably messy

Relevance ranking is notoriously hard to get right

Optimizing for one priority (e.g., prioritize finding people) may de-optimize another (e.g., prioritize finding by topic or geography)
Some people will be hard to please

This will best be addressed by developing a set of test data and associated tests to verify that consistent results are being produced – and other features like support for nicknames or diacritics not broken.

Confusion over what semantic search means

There appears to be little consensus about what semantic search means.

For some people this simply means being able to search based on a set of controlled vocabulary terms, usually hierarchical, with assurance that the entries being searched have been consistently tagged. While some VIVOs have better tagging than others – notably Griffith and Melbourne– tagging has not been assumed to be complete by the vivosearch linked index builder, nor are links to controlled vocabularies given any boost in relevance ranking
- It's also difficult to imagine that any single vocabulary could be adopted consistently across all VIVOs, much less across a range of research networking platforms
For purists, a semantic search must leverage specific ontology relationships and ideally be able to interpret natural language phrasing of queries
- Such query might be phrased as, "Find all people who have taught courses or received grants in gene therapy"
- There are a number of problems in setting the bar this high, including the many challenges of interpreting natural language and translating to the available classes and relationships in an ontology – e.g., mapping "received grants" to having realized a principal investigator or co-principal investigator role on some grant. Queries might also transcend logical interpretation to assume computational elements, especially for ranking results – putting the person who has taught 10 courses over another who has taught only 2. These problems would then be compounded by issues of inconsistent population in the many distributed sources of data.
VIVO search aims in the middle –
- To leverage the structure of VIVO data in populating the search index – bringing in a person's publications, affiliations, awards, grants and other related entities to their "document" (entry) in the search index, as is apparent on that person's VIVO page, and pulling appropriate data into the search document records for other types of entities as well. In the vivosearch.org prototype, only the source organization and type are used as facets now, but additional facets on a person's collaborators, research areas, the funding source of grants, or geographic interest could fruitfully be added.
  - Note that the degree to which related information is separately indexed for faceting vs. just included in an alltext field for text searching will likely have a significant impact on indexing speed and hence cost and available frequency of update
- To also support text-based search across the corpus of data collected, reflecting the high likelihood that data will be sprinkled with relevant terms throughout and not just through explicit tagging
  - Content gets added to an "all text" field for each entry

Bottom line, this means we have to be able to describe how the vivosearch approach will add value over what might be possible by setting up a Google appliance to crawl the 60 CTSA center websites.

Data quality issues will limit effectiveness, especially for directly linking across sites

Inconsistent coding of data

Here our hybrid of structured indexing plus text search can be very helpful for improving recall, and the ability to facet results by type and organization can assist in limiting the number of results to be processed. Relevance ranking will be challenging, however, and efforts to add additional facets will increase the complexity and decrease the frequency with which updates can be processed in production
The ontology-driven approach can also be very helpful by supporting roll-up from more specific local extensions to the level where data are more complete and consistent, even if the volume of results may be large at the more general level. If one institution categorizes people at a very detailed level, while most do not, vivosearch will only provide granular results down to the level of classes in the vivo core ontology.

Different URIs for the same entities in different source data sets

This is the biggest likely stumbling block when the immediate benefits of searching multiple distributed databases have been realized and users start to expect the logical next steps – being able to search for data in common across the different source institutions, and especially for connections linking researchers at one institution with colleagues at another
- This broader and more ambitious goal is a compelling one for CTSAs (NIH Clinical and Translational Science Award sites) due to the explicit mandates from NIH and expectations from Congress that CTSAs will be able to show evidence of increased collaboration across as well as within CTSAs.
The fundamental challenge is that any person, funding or research organization, conference, publisher, or journal appearing in the data from more than one institution will have multiple different URIs, and the data will not likely carry enough information to support disambiguation without further analysis and processing.
- The fact that data about a person at Harvard harvested from the UF VIVO will have a UF namespace is good for provenance and will help with disambiguation but may be confusing to users, especially when it's not obvious which URI is most authoritative, as with events, organizations, journals, or funding agencies.
It remains to be seen what the most effective way to approach the disambiguation task will be – very likely this will depend on priorities, with the disambiguation of researchers at subscribing institutions likely the highest priority but also potentially the most difficult.
Third party information such as ORCIDs, Scopus or Researcher Id records, and VIAF will be very relevant but not uniformly populated

Strategy?

With a finite body of data for a fixed set of institutions, it may be possible to develop a disambiguation approach based on the entire corpus, including dealing with incremental changes as new people arrive or leave the consortium. For an open-ended VIVO search with new institutions joining on a regular basis, other strategies might come into play
- The linked data index builder currently discards the RDF it harvests as each Solr document is created. While the resulting triple store would get large, one could keep all the RDF and run queries and analysis against the whole body of data to find duplicates and create sameAs statements where statistical evidence seems to warrant it. This strategy could perhaps optimally deal with 75-80% of duplication where enough information can be collected to support a analysis to a reasonable level of confidence, but it should be noted that several large and very well funded commercial organizations devote considerable resources to the same task, albeit at larger scale. The remaining unresolvable data could be very problematic for confidence in this more ambitious service.
- The AgriVIVO project will use the information harvested from distributed research profiling systems into a common VIVO to offer services back to participating organizations supporting disambiguation at the time of interactive data entry and editing. These lookup/suggest services will start with geographic names and Agrovoc terms, where web services are already available, but are planned to be extended to organizations, people, journals, events, funding agencies, and projects in the future. Note this again is predicated on having a central triple store.
- A less comprehensive effort could perform some analysis during the processing of RDF to "accumulate" entities thought to be distinct in an independent database targeted solely at entity disambiguation. This approach could enable performing the more tractable disambiguation without necessitating resources to hold and analyze the larger body of RDF, thereby also avoiding concerns about holding full copies of institutional data; it could similarly support services offering suggestion and/or pick lists from the central disambiguation database to participating research networking systems.
It will be important to look ahead to the disambiguation issues and anticipate what aspects of indexing for VIVO search could either help or hinder future disambiguation processes, but it's also important to move ahead with the search, not only to produce the immediate benefits that will provided, but to support further analysis based on real data rather than speculation.

Page tree

contents