Issues

Business model issues

Basic elements to determine operational (not development) cost

Amount of data from the institution
- this affects the processing time that needs to be allocated as well as the increment to the size of the index
Frequency of update (again based on the processing and oversight/validation required for indexing)
Support
- providing feedback on bad data, especially to people new to ontologies and RDF
- addressing performance issues at the distributed data sources (especially if harvesting degrades the function of their production VIVO app)
- there will have to be a startup fee with some number of hours of support included, and then the ability to redirect further support to a list of consultants or companies willing to provide help

Relationship of services to sponsorship

It will be much cleaner to separate sponsorship from participation in production services.

An institution sponsors VIVO to support the effort, as well as influence and hasten its development
An institution signs up for a service if you want your data to be included in that service
Question: what will our policy be around in-kind development/testing/requirements gathering?
- Simplest answer: the institutions contributing in-kind will have the most influence, but not exclusive influence or veto power. They should be contributing based on the importance of search to themselves, both in general and/or specific features
- Question: how does the Kuali community handle this? I believe from what Penn representatives said that they have two distinct aspects to sponsorship:
  - join the Kuali foundation where an investment in all the infrastructure, and pay for the legal entity
  - then contribute to project costs for the product you want to use
    - e.g., participate on the governance, technical, and functional councils of the OLE project
    - as well as strategic governing board

What questions does this leave unanswered?

If an institution has no interest in sponsoring VIVO as software – say, if they run Profiles or another tool – but they want to sponsor development and ongoing improvement of the VIVO search tools, do we have a special category of sponsorship for that?
- Answer: we already offer the standard DuraSpace bronze ($2500), silver ($5,000), and gold ($10,000) levels for VIVO, but these do not included participation or voting rights on the VIVO Sponsors Committee (see prospectus)
- Follow-up question: how will non-voting sponsors affect direction/priorities for any aspect of VIVO?
Open question: will there be a forum for sponsors to address priorities for search?
Open question: will VIVO search be governed differently than VIVO?

Areas that may get messy

Founding sponsors may expect not to be nickel & dimed
Balancing in-kind support, sponsorship, and service fees
VIVO multi-institutional search is not entirely separable from internal search at one VIVO institution
- There were pilot efforts to extend local search to the 8-institution index in Spring, 2011
- This will likely come up again on wish lists
- There may be interest in doing this from other platforms
  - Is the VIVO searchlight relevant here?
  - Would the OpenSocial platform-neutral approach be relevant?

Limits to what we can charge

The code is all open – universities may prefer to run their own (in which case you request that they sponsor to help keep the code updated)
Service providers may decide they could host competitive search services, with their own value added in tweaks to relevance ranking, etc.
As the price goes up, a cost benefit analysis will steer people to other options including custom Google search appliances, etc.

Technical Risks

Indexing is too slow

This could be a problem for two reasons

Indexing consumes too many resources and is costly to support
Updates cannot happen with sufficient frequency to satisfy local requirements or incentivize updates
- People have more confidence in a system which they can control themselves by seeing corrections happen

What will contribute to slowness?

Indexing more detail, especially detail that is more remotely connected to the individual entity being indexed
- e.g., if you want to get the names of all co-authors for a researcher, not just the titles of their papers
Doing more to align or disambiguate data at indexing time
- e.g., including in the index a list of possible alternative organizations, people, journals, etc. to facilitate corrections during use
Doing more queries, computation or analysis to improve relevance ranking
- e.g., boosting relevance based on the number of papers or grants

Indexing interferes with performance on the site being indexed

This may become the countervailing pressure from distributed sites --

Functional risks

Relevance ranking proves intractably messy

Relevance ranking is notoriously hard to get right

Optimizing for one priority (e.g., prioritize finding people) may de-optimize another (e.g., prioritize finding by topic or geography)
Some people will be hard to please

This will best be addressed by developing a set of test data and associated tests to verify that consistent results are being produced – and other features like support for nicknames or diacritics not broken.

Confusion over what semantic search means

There appears to be little consensus about what semantic search means.

For some people this simply means being able to search based on a set of controlled vocabulary terms, usually hierarchical, with assurance that the entries being searched have been consistently tagged. While some VIVOs have better tagging than others – notably Griffith and Melbourne– tagging has not been assumed to be complete by the vivosearch linked index builder, nor are links to controlled vocabularies given any boost in relevance ranking
- It's also difficult to imagine that any single vocabulary could be adopted consistently across all VIVOs, much less across a range of research networking platforms
For purists, a semantic search must leverage specific ontology relationships and ideally be able to interpret natural language phrasing of queries
- Such query might be phrased as, "Find all people who have taught courses or received grants in gene therapy"
- There are a number of problems in setting the bar this high, including the many challenges of interpreting natural language and translating to the available classes and relationships in an ontology – e.g., mapping "received grants" to having realized a principal investigator or co-principal investigator role on some grant. Queries might also transcend logical interpretation to assume computational elements, especially for ranking results – putting the person who has taught 10 courses over another who has taught only 2. These problems would then be compounded by issues of inconsistent population in the many distributed sources of data.
VIVO search aims in the middle –
- To leverage the structure of VIVO data in the structure of the search index – bringing in a person's publications, affiliations, awards, grants and other related entities to the search index. In the vivosearch.org prototype, only the organizational affiliation and type are used as facets now, but additional facets on a person's collaborators, research areas, the funding source of grants, or geographic interest could fruitfully be added.
  - Note that the degree to which related information is separately indexed for faceting vs. just included in an alltext field for text searching will likely have a significant impact on indexing speed and hence cost and available frequency of update
- To also support text-based search across the corpus of data collected, reflecting the high likelihood that data will be sprinkled with relevant terms throughout and not just through explicit tagging

Bottom line, this means we have to be able to describe how the vivosearch approach will add value over what might be possible by setting up a Google appliance to crawl the 60 CTSA center websites.

Page tree

contents