Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Ability to search across multiple VIVO installations. (Please refine this)

  1. What features are desired for the search?
  2. What type of search? 
  3. What is the goal of the search?
    1. Full text?  - yes
    2. "semantic"?  - future
    3. faceted?  

...

    1. - yes
    2. Complex queries? - future
    3. For people? - yes
    4. For publications

...

    1. , organizations, etc? - yes, but needs futher refinement

Approaches

...

Make a index to support the desired types of search and have a web site that facilitates user with querying that index. Keep that index up-to-date.

...

Approach to building the index

...

  1. For each institution

...

    1. ...

        1. Get a list of all URIs of interest for that institution

      ...

      1. For each URI

      ...

        1. ...

            1. Get the linked data RDF for the URI

          ...

            1. Build a document using that data

          ...

            1. Add the document to the Solr index

          ...

          Notes

          • Same as what is currently in place
          • Current VIVO does not have a direct way to get institutional URIs
          • VIVO used to get RDF for each URI, then make subsequent requests as needed
            • Can investigate new approaches
          • Policy questions
            • How much data do we want to get from each resource (e.g. people)
            • This is the kind of thing that needs to be asked of the institutions
            • Suggestion to collect these tasks in a spreadsheet
              • Include time estimates, and outstanding questions
            • How to determine when external resources have changed
          Approach to keeping the index up-to-date

          ...

          ...

          1. For each institution

          ...

            1. ...

                1. Get a list of URIs that have been modified

              ...

              1. For each URI

              ...

                1. ...

                    1. Calculate what individuals are affected by this modification

                  ...

                    1. Add to update list

                  ...

                  1. For each URI in update list

                  ...

                    1. ...

                        1. Get the linked data RDF for the URI

                      ...

                        1. Build a document using that data

                      ...

                        1. Add the document to the Solr index

                      Notes

                      • Hope is that the approach is same as building the index

                      ...

                      • , with different input
                      Alternatives

                       TODO: what other approaches are there?

                      Technology Choices

                      ...

                      1. There are some parts of the technology stack that are suggested

                      ...

                      1. by the goal of indexing data from VIVO. 
                        • Using HTTP requests for RDF

                      ...

                        • to gather data from the sites is the most direct approach.
                        • Most other options for gathering data from the VIVO sites would need additional coding.

                      ...

                      1. In general we would go with Solr for the search index because of we have experience with it, because of its documentation, because of it distributed features and because it is mature.

                      ...

                      1. As of 2012 vivosearch.org uses Drupal and solrsearch javascript libraries. 
                        • This choice could be revisited for the multi-site VIVO search project. 

                      ...

                      1. In order to scale the process out we were planing to use Hadoop to manage parallel tasks. 
                        • Many approaches to the problem of indexing linked data from VIVO sites would be embarrassingly paralleled.

                      Notes

                      • HTTP for retrieving RDF, yes
                      • What is the adoption of SPARQL in the community
                      • It may be nice to demonstrate that a SPARQL endpoint is not needed to enable interesting results
                      • Solr, seems reasonable for now
                        • Considering having Solr in one place versus distributed Solr (master/slaves)
                      • Web interface: drupal with solrsearch.js
                        • Most work is on clientside with js
                        • This continues to be appealing
                        • We have limited insight into this component
                        • Suggestion to create list of default technologies, criteria, and alternatives
                      • Hadoop is currently reasonable choice
                      • Ruby (blacklight/hydra) or Drupal?
                        • The js pattern allows from minimal reliance on Drupal
                      • Need a mock-up of the UI to inform design of solr index
                      • BootStrap is an interesting js framework to consider
                      • Drupal upgrade cycle can be onerous


                      Technology Alternatives

                      We could use a different index software other than Solr. What would that be? A database server with full text capabilities?  What are other options?  Are there full text search NoSQL options?

                      ...