Goals

Ability to search across multiple VIVO installations. This means:

Search
  1. What features are desired for the search?
  2. What type of search? 
  3. What is the goal of the search?
    1. Full text?  - yes
    2. "semantic"?  - future – the indexing takes advantage of the semantic structure of the VIVO ontology to include relevant text in the Solr document for each entry, but the search interface does not support queries that depend directly on the semantic relationships (e.g., find all principal investigators of grants investigating cancer who have collaborations with researchers on depression) 
    3. faceted?  - yes, though this could benefit from expansion
    4. Complex queries? - future
    5. For people? - yes
    6. For publications, organizations, etc? - yes, but needs further refinement

Approaches

Make a index to support the desired types of search and have a web site that facilitates user with querying that index. Keep that index up-to-date.

Approach to building the index
  1. For each institution
    1. Get a list of all URIs of interest for that institution
  2. For each URI
    1. Get the linked data RDF for the URI
    2. Build a Solr index "document" using the RDF statements for which that URI is the subject; subsequent request obtain additional data for related objects based on VIVO's linked data harvesting patterns, that will add to the index a person's title(s) from their position(s) and other data from their VIVO page (real or virtual, if from another system) that would normally be indexed with that person in VIVO's internal search
      • TODO: what governs the follow-on linked data requests, and do the results from what is harvested into a local VIVO search?
    3. Add the document to the Solr index

Notes

Approach to keeping the index up-to-date
  1. For each institution
    1. Get a list of URIs that have been modified, based on the last modified date for that URI in the source system's internal VIVO search index
  2. For each URI
    1. Calculate what individuals are affected by this modification
    2. Add to update list
  3. For each URI in update list
    1. Get the linked data RDF for the URI
    2. Build a document using that data
    3. Add the document to the Solr index

Notes

Alternatives Approaches

 TODO: what other approaches are there?

Technology Choices
  1. There are some parts of the technology stack that are suggested by the goal of indexing data from VIVO. 
  2. In general we would go with Solr for the search index because of we have experience with it, because of its documentation, because of it distributed features and because it is mature.
  3. As of 2012 vivosearch.org uses Drupal and solrsearch javascript libraries. The js libraries allow the development of the search UI with only client side interaction ( https://github.com/evolvingweb/ajax-solr ).
  4. In order to scale the process out we were planing to use Hadoop to manage parallel tasks and to run the indexing jobs on a set VMs setup as Hadoop nodes.

Notes

Technology Alternatives

  1. We could use a different index software other than Solr. 
  2. What the the alternatives to Hadoop?  
  3. Serving the web site could be done with just about any system that allows interaction with Solr.  
Index Updates
  1. Once an index was created how would it be updated?
    1. Rebuild the whole index?
    2. Get a list of modified individuals from each site and only reindex them?