harvesting information from several independent installations of VIVO or other software that can export produce RDF compatible with the VIVO ontology in one of 3 (or possibly more) ways
- responding to linked open data requests in one of several RDF serializations
  - note that this may be directly from a VIVO application or from Harvard Profiles
  - or from another application configured to return RDF
    - e.g., Iowa's Loki software does not store data natively in RDF but can return it in response to linked data requests
    - or using D2R (http://d24q.org)
    - or using tools such as John Fereira's semantic services, although these were designed to deliver data from VIVO to other applications not configured to consume RDF directly
- returning an entire file of RDF from a web-accessible directory (a file with only the statements about the URI requested; it my also be possible to return one big file containing that URI)
- responding to SPARQL query requests from a public SPARQL endpoint
  - or, if the harvesting tool is provided with credentials, from a private SPARQL endpoint
indexing the information harvested, including the original URI in the source system and a subset of the content associated with that URI in the source system, to facilitate text-based searching
providing a simple, Google-like search with options to limit in advance by type of result (e.g., people, organizations, publications, events)
providing results that have been relevance ranked across the sources being searched, in contrast to federated searches
providing short snippets of text for each result to aid interpretation
providing faceted display to aid users in filtering results; the two current facets are source institution and the type of result
linking back from each result to the source so that the full scope of the result can be seen in its original context

What features are desired for the search?
What type of search?
What is the goal of the search?
1. Full text? - yes
2. "semantic"? - future – the indexing takes advantage of the semantic structure of the VIVO ontology to include relevant text in the Solr document for each entry, but the search interface does not support queries that depend directly on the semantic relationships (e.g., find all principal investigators of grants investigating cancer who have collaborations with researchers on depression)
3. faceted? - yes, though this could benefit from expansion
4. Complex queries? - future
5. For people? - yes
6. For publications, organizations, etc? - yes, but needs futher further refinement

Approaches

Make a index to support the desired types of search and have a web site that facilitates user with querying that index. Keep that index up-to-date.

Approach to building the index

For each institution
1. Get a list of all URIs of interest for that institution
For each URI
1. Get the linked data RDF for the URI
2. Build a Solr index "document" using the RDF statements for which that URI is the subject; subsequent request obtain additional data for related objects based on VIVO's linked data harvesting patterns, that will add to the index a person's title(s) from their position(s) and other data from their VIVO page (real or virtual, if from another system) that would normally be indexed with that person in VIVO's internal search
  - TODO: what governs the follow-on linked data requests, and do the results from what is harvested into a local VIVO search?
3. Add the document to the Solr index

...

There are some parts of the technology stack that are suggested by the goal of indexing data from VIVO.
- Using HTTP requests for RDF to gather data from the sites is the most direct approach.
- Most other options for gathering data from the VIVO sites would need additional coding.
In general we would go with Solr for the search index because of we have experience with it, because of its documentation, because of it distributed features and because it is mature.
As of 2012 vivosearch.org uses Drupal and solrsearch javascript libraries. The js libraries allow the development of the search UI with only client side interaction ( https://github.com/evolvingweb/ajax-solr ).
- This choice could be revisited for the multi-site VIVO search project.
In order to scale the process out we were planing to use Hadoop to manage parallel tasks . and to run the indexing jobs on a set VMs setup as Hadoop nodes.
- Many approaches to the problem of indexing linked data from VIVO sites would be embarrassingly paralleled.
- Brian Caruso Cornell has worked with RDF indexing to Solr on Hadoop clusters on Eucalyptus clouds.
- Consider using a IaaS abstraction layer such as jclouds, apache libcloud or overmind. These allow developing against an interface which can then target many different cloud service providers. The primary goal of this would be to avoid lock in to one cloud provider.

Notes

HTTP for retrieving RDF, yes
What is the adoption of SPARQL in the community
It may be nice to demonstrate that a SPARQL endpoint is not needed to enable interesting results
Solr, seems reasonable for now
- Considering having Solr in one place versus distributed Solr (master/slaves)
Web interface: drupal with solrsearch.js
- Most work is on clientside with js
- This continues to be appealing
- We have limited insight into this component
- Suggestion to create list of default technologies, criteria, and alternatives
Hadoop is currently reasonable choice
Ruby (blacklight/hydra) or Drupal?
- The js pattern allows from minimal reliance on Drupal
Need a mock-up of the UI to inform design of solr index
BootStrap is an interesting js framework to consider
Drupal upgrade cycle can be onerous

...

Page tree

Versions Compared

Old Version 9

New Version Current

Key

Approaches

Approach to building the index

Page tree

Page History

Versions Compared

Old Version 9

New Version Current

Key

Approaches

Approach to building the index