Deprecated. This material represents early efforts and may be of interest to historians. It doe not describe current VIVO efforts.
Deprecated. This material represents early efforts and may be of interest to historians. It doe not describe current VIVO efforts.
Ability to search across multiple VIVO installations.
Full text? "semantic"? faceted? other? Complex queries? For people? For publications?
Make a index to support the desired types of search and have a web site that facilitates user with querying that index. Keep that index up-to-date.
for each institution:
get a list of all URIs of interest for that institution
for each URI:
get the linked data RDF for the URI
build a document using that data
add the document to the Solr index
For each institution:
get a list of URIs that have been modified
for each URI:
calculate what individuals are affected by this modification
add to update list
for each URI in update list:
get the linked data RDF for the URI
build a document using that data
add the document to the Solr index
TODO: what other approaches are there?
There are some parts of the technology stack that are suggested by the goal of indexing data from VIVO. Using HTTP requests for RDF to gather data from the sites is the most direct approach. Most other options for gathering data from the VIVO sites would need additional coding.
In general we would go with Solr for the search index because of we have experience with it, because of its documentation, because of it distributed features and because it is mature.
As of 2012 vivosearch.org uses Drupal and solrsearch javascript libraries. This chose could be revisited for the multi-site VIVO search project. If the solrsearch javascript can provide almost all of the interactivity on the client side it might be desirable to
have the server side be as simple as possible. It may even be possible to use static HTML and .js files served by any old web server.
In order to scale the process out we were planing to use Hadoop to manage parallel tasks. Many approaches to the problem of indexing linked data from VIVO sites would be embarrassingly paralleled.
We could use a different index software other than Solr. What would that be? A database server with full text capabilities? What are other options? Are there full text search NoSQL options?
What the the alternatives to Hadoop? What other ways would sufficient management of multiple tasks? Could we just do it as multiple java processes or multiple java threads? OSGi? Some of the hadoop related systems like hadoop Streaming or Cascade?
Serving the web site could be done with just about any system that allows interaction with Solr. The solrsearch javascript libraries would allow any system that serves HTML and js to server this. The options are expansive: httpd, wordpress, movible type, drupal, cold fusion.
Rebuild the whole index?
Get a list of modified individuals from each site and only reindex them?