Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

QuestionResponse
What is the ideal deployment environment?

Stable Linux VM running Apache/Drupal or Apache/Ruby (this could be rewritten to use Hydra/Blacklight)

Stable Linux VM hosting the Solr index (could probably initially be on 1 VM)

More dynamically allocated Linux VMs for indexing – Brian C. has worked on the dynamic spin-up of instances within the Hadoop framework but seems like an improvement after starting out with a small number of stable VMs that are imaged but started and stopped manually. In time an on-demand mode of virtual machine usage, as Hadoop can manage, could help control costs. Andrew confirms the cost-saving aspects as they need to be balanced against high availability and redundancy. Important to keep in mind the requirements on the applications in the VM – can you initialize everything that needs to be initialized without handholding.

After beta phase, move toward a staging server and/or load balancing – the web traffic will not likely be so large but what we could manage it in a day-to-day sort of way. It's the indexing that takes a significant amount of CPU time.

What is the set of features for the production application?
  • Index all RDF content from remote sites as frequently as on a 24 hr basis, though the beta might be less often (weekly, monthly, or quarterly) – the business model could reflect both size of data and frequency of update
    • include email addresses, local institutional identifiers, ORCiDs, ResearcherId, ScopusId, and other identifiers that might be useful
    • Jim points out that sites may not want to be hammered by frequent linked data requests
  • Faceted search interface with some ability to deal with scaling in the number of participating sites
  • Ability to adjust relevance ranking

Features more related to the über project:

  • After beta phase, ability to analyze the Solr index (or keeping the RDF to analyze that) for duplicate names to be able to provide disambiguated results, correlated against ORCID and services such as http://viaf.org
  • After beta phase, ability to provide web services back to distributed sites allowing them to choose people/places/journals/organizations from a central index to improve data quality and lower the cost of ongoing disambiguation
What hardware resources are needed to run the application? What level of effort in what roles? (note that this will be different from the development period)

3 parts to the code: front end, the index, and the index builder

Need to spec out each of the components and the effort to bring each up to what is required, including the ability to swap out each part.

  • Very simple Drupal site that could probably alternatively use Wordpress or Ruby; Javascript libraries to enhance interaction and support responsive design for mobile devices;
  • Tomcat and Solr and the ability to fine-tune a Solr index, query configuration(s), and relevance ranking
  • Data manager to work with participating site, educate new sites on how to prepare data, do quality control on data on first index, respond to inquiries about relevance ranking, and organize disambiguation efforts.
  • After beta phase, a programmer to work with on disambiguation initiatives and the development of services to offer disambiguated data back to participating sites (and potentially others) on a fee basis
What resources are needed for on-going maintenance?
  • Periodic review and updating of UI and device/browser independence
  • Ongoing improvement to the indexing code to run more efficiently, permit more frequent updates, detect duplicates dynamically, improve faceting of results, support network analysis and derivative services out of the corpus of data gathered
  • Ongoing hand-holding for existing and new-users, including managing transitions in the VIVO ontology over time
What is required for application initialization?
  • The current http://vivosearch.org site is a simple Drupal site on a cloud VM with a minimal internal database, access to a Solr index housed on a server at Cornell; it has been very stable, needing attention perhaps 3 or 4 times in two years.
Who will provide post-implementation tech support to users?  How much will be needed?
  • This should be somebody familiar with RDF and the VIVO ontology and with Solr configuration as well as the idiosyncracies of university data sources; an expert programmer would not be essential except at times of introducing new features, increasing efficiency, or significantly scaling the number of institutions participating
  • Over time one goal would be to size VIVO search at a scale that can support itself plus provide some ongoing funding for DuraSpace and VIVO efforts as a whole; a dedicated technically-qualified support person would help assure the stability

How does that data need to be made available to the linked data indexer?

Via Linked Data/HTTP

What is compatible data?

The VIVO multi-institutional search is predicated on harvesting and indexing RDF compatible with the VIVO ontology beginning with version 1.3. We do not believe that subsequent changes to the ontology since version 1.3 have been significant enough to require changes to the indexer, but this will need to be confirmed with every new VIVO release.

The vivosearch.org site demonstrates that Harvard Profiles produces linked data compatible with VIVO

 

Jonathan – Don't conflate services with sponsorship.

...