You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 4
Next »
Attendees
Jonathan Markow
Andrew Woods
Jonathan Corson-Rikert
Brian Caruso
Brian Lowe
Jim Blake
Agenda
- Understand how we move forward
- Important for new sponsorship
- Options
- Are we interested in hosting?
- Are we interested in development?
Discussion
- LinkedIndexBuilder
- Crawls data over VIVO instances
DuraSpace
- We see cool opportunities for VIVOSearch
- Division of labor
- Existing VIVO Search team do development VIVO side
- Duraspace will host and do system adminstration
- Longer term
- Interested in greater inter-operation
- View collaboration as team effort
VIVOSearch technology
- Built software to create a Solr index from distributed data
- Built webapp over the index
- Harvesting was never brought to production state
- Initial plans for Hadoop
- They did not then have the Hadoop experience
- Went with Scala
- Have some regrets with Scala implementation, especially the Actor framework
- When resuming work, would go in the direction of Hadoop, where are working now with 7 million library catalog records at Cornell using the Cornell Red Cloud virtual machines, using Eucalyptus
- Development options
- Try to continue with the Scala-Actor code (not recommended)
- Or, take the useful parts (Java code) and move towards Hadoop
Approach
- Take a high-level, broad view
- Goal: have a VIVOSearch app for range of institutions
- Come up with comprehensive list of questions
- What is the optimal platform?
- What should the production app should do?
- How many institutions?
- What do we need to run it?
- What do we need to support it?
- Once we start to answer questions
- We can start to come up with tasks
System parts
- Drupal website
- Solr index
- Solr is currently populated with Scala-Actor code
- Components interact via HTTP
- Drupal frontend is a bit unknown
- Others (Nick Cappadona, Miles Worthington) worked on it; not on the project but still available to consult
- Was considered beta
- Probably in Drupal 6
- Can not stretch the UI to gracefully handle 200 schools
- Does follow responsive design principles, but predates ready-made javascript/css libraries such as Bootstrap/Sass
- Earlier conversations around using a js-solr library
- Do not recall the name of the library
- Question: is current state in production state or beta?
- Mostly beta, except Solr index
- No reason to stick with Drupal, except that the current site and the Drupal Solr module (and some local extensions for this purpose) exist
- The current harvesting/indexing backend is considered a deadend
- Indexing approach will be similar to that used for a Cornell Library project building a new catalog search interface
- Take MARC records in catalog, running them through a conversion process to RDF, and then creating a Solr index for Blacklight
- Should we be considering Nutch?
- We need to consider how much load we put on institution VIVOs in indexing
- Should we do complete harvests every time or develop an incremental harvesting capability
- Is not trivial to identify what has changed since one VIVO "page" brings in data from sometimes hundreds of related entities, including other people, publications, events, etc.
Scenarios
- Replicate VIVOSearch.org in DuraSpace infrastructure
- Some institutions would like to pick up app and run locally, or might run their own Amazon instances, or might want to take advantage of DuraSpace services for hosting a private index and search landing site
- Key selling point: agnostic to the software that produces the RDF
- Currently proven to work with VIVO and Harvard Profiles
- Elsevier Scival also claims compatibility – Northwestern University will be the test for that
- Iowa has a home-grown conversion to VIVO and a capable researcher active in the CTSA researcher networking community (Dave Eichmann)
- but it's important any site can participate through simply putting VIVO-compatible RDF in a web-accessible directory – low barrier to entry
- Would like to validate user RDF in the future
- likely an on-boarding process – once we have indexed a site once, it's likely to go smoothly thereafter
Demo
- Performed brief demo
- UI has gone through fair amount of rigor
Pilot group
- CTSAs (the ~60 NIH-funded Clinical and Translational Science Awards)
- They are committed by majority vote of the principal investigators to doing researcher networking and search using the VIVO ontology
- They (or a subset ready to act soon) could serve as a good pilot group
- University of Colorado wants to use for VIVO on 2 campuses (Boulder and Colorado Springs) plus the medical campus in Denver, which has Harvard Profiles
- They are DuraSpace VIVO sponsors (Alex Viggio is the VIVO implementation lead) and are willing to contribute a developer (Stephen Williams)
- They have ties to several federal labs in the Denver/Golden/Boulder area (NIST, NREL, NCAR) and to UCAR, a consortium of ~100 universities in atmospheric research and administrator of NCAR
- UCAR is adopting VIVO to track virtual organizations developed around scientific campaigns
- Existing partners may be willing to participate without extensive preconditions or delay
- VIVO
- UF, Cornell, Weill Cornell, Colorado, Duke, Brown, Stony Brook, Indiana, Scripps, USDA, APA, and likely several others
- Harvard Profiles
- Harvard, UCSF, and likely a couple others (Wake Forest?)
- Scival Experts
- Northwestern, Oregon Health & Science University
- Loki
- Digital Vita (not sure how far Titus Schleyer has progressed with exporting VIVO-ready data)
- Others
- CTSAs will want a process to specify requirements/needs
- Could turn into a bureaucratic/somewhat political process
- It may be more practical working with institutions with which we have existing relationships
Key to success: no client effort
- How much work needs to happen on the client's side
- Indexing is via linked-data requests to VIVO sites
- There is no client work required if they can produce valid RDF conforming to the VIVO ontology
- Valid RDF?
- VIVO sites do not always provide valid RDF – may misunderstand the structure, be missing key relationships, or substitute local extensions in ways that don't roll up to the core VIVO ontology
- Potentially some parts of the graph can be ignored
- Client may or may not need to clean up their RDF
- Goal: require no work on client side
- Some schools do not have VIVO, but want to create "RDF export tool"
- 1 or 2 schools fall in this category so far
- Toronto, UCLA
- Running "locally"
- May mean running in AWS for a specific institution
- May mean running on local servers
Business proposition
- Bringing in more institutions to increase researcher visibility
- Make app/search internationally available – e.g., for Melbourne it would be a plus for their researchers to be searchable alongside those from major North American and European universities
Scala
- Only one developer who knows it
- Actor library is no fun
- Code that does RDF to Solr document is reused Java code
- Scala was supposed to help with multi-threaded processing
- Not a lot of cooperation between processes
- Errors tend to be cryptic
- Existing code was developed quickly, not for quality
Next Steps
- Several moving pieces
- Document questions that need to be answered
- Document issues
- Document activities
- Look ahead to some of what we can offer with the data beyond search
- This will be more speculative – setting up a separate VIVOPLAN wiki
- Together, this will give a clearer idea of the scope of the project
- Need to also determine how to turn it into a viable "business"
- Some discussions are already happening in the wiki
- Suggestion to put documents in wiki
- Leverage JIRA?
- Create wiki space for VIVOSearch
- Use VIVO crowd permissions
- Grant admin rights to:? j2blake
- Next call: week after next
- Need to get a rough notion of timeframe and institution cost
- Would like to have something at VIVO conference (Aug)
- 12-1pm Tues Jan 29th
- Jon to send out Web-Ex invite