Task Name	Time Est. hours	Link to section
Setup Development Environment	8	1
Get URIs for Institution	32	2
Get data for an individual URI	32	3
Mockup of Search UI	24	4
Create Solr Doc from data for URI	40	5
Working search UI prototype	40	6
baisc multi-node Hadoop cluster on IaaS	40	7
automated and scripted cluster on IaaS	40	8
Data validation code for Institution's data	80	9
Update system	40	#10

1 Setup development environment

Git repository. How will this relate to the old project at https://github.com/vivo-project/Linked-Data-Indexer ? Should it be a fork or branch or what?

Document single node Hadoop setup.

Development Solr service setup. Must use Sorl 4.x or greater. There have been huge improvements in Solr/Lucene going from 3.x to 4.x. I've encountered systems where setting up solr can be a bit of a chore because the instructions don't make it clear what version of solr to use and what additional libraries to add. I suggest one ofthe following 1) making the instructions very clear about which version of solr to use OR 2) automating the build by downloading a URL, and copying files to the correct location for the solr home directory.

Ant/Ivy build script.

Wiki/git README documentation.

2 Develop code to build list of URIs to index for Institution from standard 1.5.1 VIVO instance

There is code to parse Catalyst pages to URIs (CatalystPageToURIs.java) and to parse the JSON from VIVO ( ParseDataSErviceJson.java). There is code to do the discovery of URIs for Catalyst and VIVO in LinkedDataIndexer/src/main/scal/edu/cornell/indexbuildere/discovery in VivoUriDiscoveryWorker.scala and CatalystDiscveryWorker.scala. These files could be used as examples but they depend heavily on the akka framework which we'd like to move away from.

3 Develop code to gather data required for an individual URI

See UrisForDataExpansion.java for an example of how this was done in the prototype.

4 Mockup of search UI

5 Develop code to build and index Solr document from data for URI

This depends on Mockup of the search UI in order to develop the schema for the Solr index.

SolrDocWorker.scala uses the DocumentModifier from the Vitro code to generate a Solr document from a model for a URI. We may want to reuse this approach. Much of this code is found in LinkedDataIndexer/src/main/java/edu/cornell/mannlib/vitro/webapp/search/solr. There can be found a new translate that works well without the webapp context at MultiSiteIndexToDoc.java and new DocumentModifiers that are needed for multi site indexing.

6 Working Prototype of Search UI

Make tech decisions about serving search UI and about how the UI client will communicate with the Solr service.

7 Explore multi-node Hadoop cluster deployed to IaaS

8 Scripted deploy of multi-node hadoop cluster on IaaS

9 Data Validation code for institution's data

10 Update system

Develop a system to allow updates. This is likely to involve some additional services as part of the VIVO webapp. The Mulit-site search index builder will need to query the VIVO webapp for a list of URIs that have been updated for a given time frame.

Page tree

Tasks