Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Ability to search across multiple VIVO installations. (Please refine this)This means:

  • harvesting information from several independent installations of VIVO or other software that can export RDF compatible with the VIVO ontology
  • indexing the information harvested, including the original URI in the source system and a subset of the content associated with that URI in the source system, to facilitate text-based searching
  • providing a simple, Google-like search with options to limit in advance by type of result (e.g., people, organizations, publications, events)
  • providing results that have been relevance ranked across the sources being searched, in contrast to federated searches
  • providing short snippets of text for each result to aid interpretation
  • providing faceted display to aid users in filtering results; the two current facets are source institution and the type of result
  • linking back from each result to the source so that the full scope of the result can be seen in its original context
  1. What features are desired for the search?
  2. What type of search? 
  3. What is the goal of the search?
    1. Full text?  - yes
    2. "semantic"?  - future future – the indexing takes advantage of the semantic structure of the VIVO ontology to include relevant text in the Solr document for each entry, but the search interface does not support queries that depend directly on the semantic relationships (e.g., find all principal investigators of grants investigating cancer who have collaborations with researchers on depression) 
    3. faceted?  - yes, though this could benefit from expansion
    4. Complex queries? - future
    5. For people? - yes
    6. For publications, organizations, etc? - yes, but needs futher refinement

...

  1. For each institution
    1. Get a list of all URIs of interest for that institution
  2. For each URI
    1. Get the linked data RDF for the URI
    2. Build a Solr index "document" using the RDF statements for which that URI is the subject; subsequent request obtain additional data for related objects based on VIVO's linked data harvesting patterns, that will add to the index a person's title(s) from their position(s) and other data from their VIVO page (real or virtual, if from another system) that would normally be indexed with that person in VIVO's internal search
      • TODO: what governs the follow-on linked data requests, and do the results from what is harvested into a local VIVO search?
    3. Add the document to the Solr index

...

  • Same as what is currently in place
  • Current VIVO does not have a direct way to get institutional URIs; VIVO has the option of differentiating internal from external URIs for any type of entity, and this could be useful in harvesting only institutional URIs pertaining to the source of the system.
  • VIVO used to get RDF for each URI, then make subsequent requests as needed
    • Can investigate new approaches
  • Policy questions
    • How much data do we want to get from each resource (e.g. people)
    • This is the kind of thing that needs to be asked of the institutions
    • Suggestion to collect these tasks in a spreadsheet
      • Include time estimates, and outstanding questions
    • How to determine when external resources have changed

...

  1. For each institution
    1. Get a list of URIs that have been modified, based on the last modified date for that URI in the source system's internal VIVO search index
  2. For each URI
    1. Calculate what individuals are affected by this modification
    2. Add to update list
  3. For each URI in update list
    1. Get the linked data RDF for the URI
    2. Build a document using that data
    3. Add the document to the Solr index

...

  • Hope is that the approach is same as building the index, with different input
Alternatives Approaches

 TODO: what other approaches are there?

  • what use should be made of the institutional internal class, if populated, to limit data harvested to what is part of the institution harvested (note that we can't rely on this being populated, especially for data not produce by the VIVO software
    • this may not always be the intended effect – e.g., it may be desirable to harvest funding agencies but not the names of the institutions listed with educational training
  • should the data harvested align with what is included in a VIVO internal search, or be much more limited (both by harvesting only certain types and by doing fewer follow-on queries for data closely related to the individual being harvested
Technology Choices
  1. There are some parts of the technology stack that are suggested by the goal of indexing data from VIVO. 
    • Using HTTP requests for RDF to gather data from the sites is the most direct approach.
    • Most other options for gathering data from the VIVO sites would need additional coding.
  2. In general we would go with Solr for the search index because of we have experience with it, because of its documentation, because of it distributed features and because it is mature.
  3. As of 2012 vivosearch.org uses Drupal and solrsearch javascript libraries. The js libraries allow the development of the search UI with only client side interaction ( https://github.com/evolvingweb/ajax-solr ).
    • This choice could be revisited for the multi-site VIVO search project. 
  4. In order to scale the process out we were planing to use Hadoop to manage parallel tasks. 
    • Many approaches to the problem of indexing linked data from VIVO sites would be embarrassingly paralleled.

...