Deprecated. This material represents early efforts and may be of interest to historians. It doe not describe current VIVO efforts.

Attendees

Jonathan Markow
Andrew Woods
Jonathan Corson-Rikert
Brian Caruso
Brian Lowe
Jim Blake

Agenda

  1. Understand how we move forward
    • Important for new sponsorship
  2. Options
    • Are we interested in hosting?
    • Are we interested in development?

Discussion

Existing tooling

  1. LinkedIndexBuilder
    • Crawls data over VIVO instances

DuraSpace

  1. We see cool opportunities for VIVOSearch
  2. Division of labor
    • Existing VIVO Search team do development VIVO side
    • Duraspace will host and do system adminstration
  3. Longer term
    • Interested in greater inter-operation
  4. View collaboration as team effort

VIVOSearch technology

  1. Built software to create a Solr index from distributed data
  2. Built webapp over the index
  3. Harvesting was never brought to production state
  4. Initial plans for Hadoop
    • They did not then have the Hadoop experience
  5. Went with Scala
    • Have some regrets with Scala implementation, especially the Actor framework
    • When resuming work, would go in the direction of Hadoop, where are working now with 7 million library catalog records at Cornell using the Cornell Red Cloud virtual machines, using Eucalyptus
  6. Development options
    • Try to continue with the Scala-Actor code (not recommended)
    • Or, take the useful parts (Java code) and move towards Hadoop

Approach

  1. Take a high-level, broad view
  2. Goal: have a VIVOSearch app for range of institutions
  3. Come up with comprehensive list of questions
    • What is the optimal platform?
    • What should the production app should do?
    • How many institutions?
    • What do we need to run it?
    • What do we need to support it?
  4. Once we start to answer questions
    • We can start to come up with tasks

System parts

  1. Drupal website
  2. Solr index
    • Vanilla Tomcat with Solr
  3. Solr is currently populated with Scala-Actor code
  4. Components interact via HTTP
  5. Drupal frontend is a bit unknown
    • Others (Nick Cappadona, Miles Worthington) worked on it; not on the project but still available to consult
    • Was considered beta
    • Probably in Drupal 6
    • Can not stretch the UI to gracefully handle 200 schools
    • Does follow responsive design principles, but predates ready-made javascript/css libraries such as Bootstrap/Sass
  6. Earlier conversations around using a js-solr library
    • Do not recall the name of the library
  7. Question: is current state in production state or beta?
    • Mostly beta, except Solr index
    • No reason to stick with Drupal, except that the current site and the Drupal Solr module (and some local extensions for this purpose) exist
    • The current harvesting/indexing backend is considered a deadend
  8. Indexing approach will be similar to that used for a Cornell Library project building a new catalog search interface
    • Take MARC records in catalog, running them through a conversion process to RDF, and then creating a Solr index for Blacklight
  9. Should we be considering Nutch?
    • We need to consider how much load we put on institution VIVOs in indexing
  10. Should we do complete harvests every time or develop an incremental harvesting capability
    • Is not trivial to identify what has changed since one VIVO "page" brings in data from sometimes hundreds of related entities, including other people, publications, events, etc.

Scenarios

  1. Replicate VIVOSearch.org in DuraSpace infrastructure
  2. Some institutions would like to pick up app and run locally, or might run their own Amazon instances, or might want to take advantage of DuraSpace services for hosting a private index and search landing site
  3. Key selling point: agnostic to the software that produces the RDF
    1. Currently proven to work with VIVO and Harvard Profiles
    2. Elsevier Scival also claims compatibility – Northwestern University will be the test for that
    3. Iowa has a home-grown conversion to VIVO and a capable researcher active in the CTSA researcher networking community (Dave Eichmann)
    4. but it's important any site can participate through simply putting VIVO-compatible RDF in a web-accessible directory – low barrier to entry
  4. Would like to validate user RDF in the future
    1. likely an on-boarding process – once we have indexed a site once, it's likely to go smoothly thereafter

Demo

  1. Performed brief demo
  2. UI has gone through fair amount of rigor

Pilot group

  1. CTSAs (the ~60 NIH-funded Clinical and Translational Science Awards)
    • They are committed by majority vote of the principal investigators to doing researcher networking and search using the VIVO ontology
    • They (or a subset ready to act soon) could serve as a good pilot group
  2. University of Colorado wants to use for VIVO on 2 campuses (Boulder and Colorado Springs) plus the medical campus in Denver, which has Harvard Profiles
    1. They are DuraSpace VIVO sponsors (Alex Viggio is the VIVO implementation lead) and are willing to contribute a developer (Stephen Williams)
    2. They have ties to several federal labs in the Denver/Golden/Boulder area (NIST, NREL, NCAR) and to UCAR, a consortium of ~100 universities in atmospheric research and administrator of NCAR
      1. UCAR is adopting VIVO to track virtual organizations developed around scientific campaigns
  3. Existing partners may be willing to participate without extensive preconditions or delay
    • VIVO
      • UF, Cornell, Weill Cornell, Colorado, Duke, Brown, Stony Brook, Indiana, Scripps, USDA, APA, and likely several others
    • Harvard Profiles
      • Harvard, UCSF, and likely a couple others (Wake Forest?)
    • Scival Experts
      • Northwestern, Oregon Health & Science University
    • Loki
      • Iowa
    • Digital Vita (not sure how far Titus Schleyer has progressed with exporting VIVO-ready data)
      • Pittsburgh
    • Others
      • Stanford, Toronto, UCLA
  4. CTSAs will want a process to specify requirements/needs
    • Could turn into a bureaucratic/somewhat political process
  5. It may be more practical working with institutions with which we have existing relationships

Key to success: no client effort

  1. How much work needs to happen on the client's side
    • Indexing is via linked-data requests to VIVO sites
    • There is no client work required if they can produce valid RDF conforming to the VIVO ontology
  2. Valid RDF?
    • VIVO sites do not always provide valid RDF – may misunderstand the structure, be missing key relationships, or substitute local extensions in ways that don't roll up to the core VIVO ontology
    • Potentially some parts of the graph can be ignored
    • Client may or may not need to clean up their RDF
  3. Goal: require no work on client side
  4. Some schools do not have VIVO, but want to create "RDF export tool"
    • 1 or 2 schools fall in this category so far
    • Toronto, UCLA
  5. Running "locally"
    • May mean running in AWS for a specific institution
    • May mean running on local servers

Business proposition

  1. Bringing in more institutions to increase researcher visibility
  2. Make app/search internationally available – e.g., for Melbourne it would be a plus for their researchers to be searchable alongside those from major North American and European universities

Scala

  1. Only one developer who knows it
  2. Actor library is no fun
  3. Code that does RDF to Solr document is reused Java code
  4. Scala was supposed to help with multi-threaded processing
    • Not a lot of cooperation between processes
  5. Errors tend to be cryptic
  6. Existing code was developed quickly, not for quality

Next Steps

  1. Several moving pieces
  2. Document questions that need to be answered
  3. Document issues
  4. Document activities
  5. Look ahead to some of what we can offer with the data beyond search
    1. This will be more speculative – setting up a separate VIVOPLAN wiki
  6. Together, this will give a clearer idea of the scope of the project
  7. Need to also determine how to turn it into a viable "business"
  8. Some discussions are already happening in the wiki
    • Suggestion to put documents in wiki
  9. Leverage JIRA?
  10. Create wiki space for VIVOSearch
    • Use VIVO crowd permissions
    • Grant admin rights to:? j2blake
  11. Next call: week after next
  12. Need to get a rough notion of timeframe and institution cost
  13. Would like to have something at VIVO conference (Aug)
  14. 12-1pm Tues Jan 29th
    • Jon to send out Web-Ex invite
  • No labels