2013-01-16 - VIVO Search Strategy Mtg

Attendees

Jonathan Markow
Andrew Woods
Jonathan Corson-Rikert
Brian Caruso
Brian Lowe
Jim Blake

We see cool opportunities for VIVOSearch
Division of labor
- Existing VIVO Search team do development VIVO side
- Duraspace will host and do system adminstration
Longer term
- Interested in greater inter-operation
View collaboration as team effort

Built software to create a Solr index
Built webapp over the index
Harvesting was never brought to production state
Initial plans for Hadoop
- They did not have the Hadoop experience
Went was Scala
- Have some regrets with Scala implementation
- If could do it again, would go in the direction of Hadoop
Development options
- Try to continue with the Scala-Actor code (not recommended)
- Or, take the useful parts and move towards Hadoop

Drupal website
Solr index
- Vanilla Tomcat with Solr
Solr is populated with Scala-Actor code
Components interact via HTTP
Drupal frontend is a bit unknown
- Others worked on it
- Was considered beta
- Probably in Drupal 6
- Can not stretch it to 200 schools
Earlier conversations around using a js-solr library
- Do not recall the name of the library
Question: is current state in production state or beta?
- Mostly beta, except Solr index
- No reason to stick with Drupal, except that it exists
- Indexing backend is considered a deadend
RDF cataloging project
- Take MARC records in catalog, running them through RDF
Should we be considering Nutch?
We need to consider how much load we put on institution VIVOs in indexing

CPSA
- They could serve as a good pilot group
Existing partners
- UF
- Cornell
- Colorado
- Duke
- Brown
CPSA could specify requirements/needs
- Could turn into a bureaucratic/political log-jam
It may be more practical working with institutions with which we have existing relationships

How much work needs to happen on the client's side
- Indexing is via linked-data requests to VIVO sites
- There is no client work required
Valid RDF?
- VIVO sites do not always provide valid RDF
- Potentially some parts of the graph can be ignored
- Client may or may not need to clean up their RDF
Goal: require no work on client side
Some schools do not have VIVO, but want to create "RDF export tool"
- 1 or 2 schools fall in this category
- Toronto
Running "locally"
- May mean running in AWS for a specific institution
- May mean running on local servers

Only one developer who knows it
Actor library is no fun
Code that does RDF to Solr document is reused Java code
Scala was supposed to help with multi-threaded processing
- Not a lot of cooperation between processes
Errors tend to be cryptic
Existing code was developed quickly, not for quality

Several moving pieces
Document questions that need to be answered
Document issues
Document activities
Together, this will give a clearer idea of the scope of the project
Need to also determine how to turn it into a viable "business"
Some discussions are already happening in the wiki
- Suggestion to put documents in wiki
Leverage JIRA?
Create wiki space for VIVOSearch
- Use VIVO crowd permissions
- Grant admin rights to:? j2blake
Next call: week after next
Need to get a rough notion of timeframe and institution cost
Would like to have something at VIVO conference (Aug)
12-1pm Tues Jan 29th
- Jon to send out Web-Ex invite