Deprecated. This material represents early efforts and may be of interest to historians. It doe not describe current VIVO efforts.

The responses below are a first cut – please feel free to question, comment, or replace (Jon).

Organizational

QuestionResponse
How many institutions does the application currently support?8 – 7 VIVO sites plus 1 Harvard Profiles site
How many institutions will be targeted?
  • initially 12-20; in 2 years, 100; ultimately (say, 5 years – hard to say, because if that many universities want to participate, a commercial search engine company will jump in and do it better)
    • if we're talking about the CTSAs as an initial group, who would that be?
      • a couple CTSAs plus some of our existing VIVO adopters
      • jon – might want to address the needs of our non-CTSA partners, as well as a small CTSA expansion that would get us 2 more platforms – Iowa and Northwestern
      • jonathan – a need for a CTSA-only network, and a need for an open network
    • then subsequently could offer participation to a larger group of CTSAs
    • have to balance the staging against external events like the VIVO conference and the CTSA fall PI meeting (October)
  • we are trying to support anyone who will provide VIVO ontology RDF in response to linked open data requests (to be most compelling, the next phase must demonstrate harvesting data from VIVO, Harvard Profiles, SciVal experts, and Iowa's Loki. If promises to do the export happen, Pittsburgh's Digital Vita and Stanford's CAP systems may also be able to provide data; ideally an institution could also participate through linked open data requests to static RDF files in a web-accessible directory)
  • will target a range of institutions including government (USDA), commercial (American Psychological Association), international (Melbourne, Griffith, Bournemouth, Eindhoven, ColPos, Memorial University of Newfoundland, etc), small (Scripps) and large (Florida, Colorado, Brown, Nebraska, Duke), and on diverse platforms (Harvard, UCSF, and Minnesota for Profiles, Northwestern for Scival Experts, Iowa for Loki)

What are the roles? Need to both

  • define roles for a project, and
  • define roles for a production service

And to know the resources (including level of effort) required for both,working our way backward from a list of tasks

We interpret this question as what skills and effort level need to be for a development project; some roles might be taken on by the same person and/or for limited periods of time.

  1. To rewrite the indexing program: an indexing programmer familiar with Java and Hadoop
  2. To update/refresh the Drupal site, including it's reliance on Solr and other modules: a web developer experienced with Drupal or an alternative open-source content management system with an established infrastructure to support Solr
  3. To adapt the UI for a larger number of institutions: UI designer experienced with developing search interfaces and comfortable working in Drupal (or the chosen platform if different)
  4. To coordinate distributed development: A technical lead
  5. A project manager to coordinate the execution, document requirements and projected deliverables, coordinate the rest of the team efforts, and develop the business model for an ongoing service
  6. A fundraising/business manager to coordinate the acquisition and allocation of resources, subscription fees, enroll service providers who could consult with sites to help them prepare and/or troubleshoot their data.
  7. A marketing person to write up public-facing documentation, recruit new organizations, collect and collate feedback and feature requests
  8. Ontologist/data curator to consult on issues that come up with people's data in indexing and and plan for version changes in the VIVO ontology with each new release
  9. A system manager to provision, monitor, and troubleshoot the virtual machines needed to support indexing and the search website, as required in an AWS environment; becomes more complex as start scaling up
What will be the division of labor? Among the partners

Initially: DuraSpace might continue to primarily do mentoring as it does for VIVO activities in general. Could alternatively do more active participation in a search project, it would require additional funds – to address marketing, system administration/support, customer service support.

We expect that VIVO ontologists and developers would rewrite the indexing code, update the UI and web front end, and review candidate data.Project management might need to be hired.

Colorado has expressed willingness to contribute to development efforts, but the project is big enough that at least some developer time will need to be budgeted.

What are the primary "keys to success"? Not just features, but factors such as ease of participation by clients, the present

The need to go beyond providing the obvious first win (integrated search) to addressing relevance ranking, frequency of update

As Mike Conlon has pointed out, the search will not fully meet the underlying expectations of facilitating CTSA network analysis unless some of the immediately visible disambiguation problems are addressed. Data will include duplicate entries for the same persons, organizations, events, funding agencies, journals, etc. coming in from indexing with multiple different URIs from the different source systems, and offering services to support linking from one university's research networking system to another.

Is there a market to attract service providers for readying data at campuses/organizations?yes, including Symplectic, SciVal Experts, Recombinant, potentially others, especially smaller consulting firms.

 

Technical

QuestionResponse
What is the ideal deployment environment?

Stable Linux VM running Apache/Drupal or Apache/Ruby (this could be rewritten to use Hydra/Blacklight)

Stable Linux VM hosting the Solr index (could probably initially be on 1 VM)

More dynamically allocated Linux VMs for indexing – Brian C. has worked on the dynamic spin-up of instances within the Hadoop framework but seems like an improvement after starting out with a small number of stable VMs that are imaged but started and stopped manually. In time an on-demand mode of virtual machine usage, as Hadoop can manage, could help control costs. Andrew confirms the cost-saving aspects as they need to be balanced against high availability and redundancy. Important to keep in mind the requirements on the applications in the VM – can you initialize everything that needs to be initialized without handholding.

After beta phase, move toward a staging server and/or load balancing – the web traffic will not likely be so large but what we could manage it in a day-to-day sort of way. It's the indexing that takes a significant amount of CPU time.

What is the set of features for the production application?
  • Index all RDF content from remote sites as frequently as on a 24 hr basis, though the beta might be less often (weekly, monthly, or quarterly) – the business model could reflect both size of data and frequency of update
    • include email addresses, local institutional identifiers, ORCiDs, ResearcherId, ScopusId, and other identifiers that might be useful
    • Jim points out that sites may not want to be hammered by frequent linked data requests
  • Faceted search interface with some ability to deal with scaling in the number of participating sites
  • Ability to adjust relevance ranking

Features more related to the über project:

  • After beta phase, ability to analyze the Solr index (or keeping the RDF to analyze that) for duplicate names to be able to provide disambiguated results, correlated against ORCID and services such as http://viaf.org
  • After beta phase, ability to provide web services back to distributed sites allowing them to choose people/places/journals/organizations from a central index to improve data quality and lower the cost of ongoing disambiguation
What hardware resources are needed to run the application? What level of effort in what roles? (note that this will be different from the development period)

3 parts to the code: front end, the index, and the index builder

Need to spec out each of the components and the effort to bring each up to what is required, including the ability to swap out each part.

  • Very simple Drupal site that could probably alternatively use Wordpress or Ruby; Javascript libraries to enhance interaction and support responsive design for mobile devices;
  • Tomcat and Solr and the ability to fine-tune a Solr index, query configuration(s), and relevance ranking
  • Data manager to work with participating site, educate new sites on how to prepare data, do quality control on data on first index, respond to inquiries about relevance ranking, and organize disambiguation efforts.
  • After beta phase, a programmer to work with on disambiguation initiatives and the development of services to offer disambiguated data back to participating sites (and potentially others) on a fee basis
What resources are needed for on-going maintenance?
  • Periodic review and updating of UI and device/browser independence
  • Ongoing improvement to the indexing code to run more efficiently, permit more frequent updates, detect duplicates dynamically, improve faceting of results, support network analysis and derivative services out of the corpus of data gathered
  • Ongoing hand-holding for existing and new-users, including managing transitions in the VIVO ontology over time
What is required for application initialization?
  • The current http://vivosearch.org site is a simple Drupal site on a cloud VM with a minimal internal database, access to a Solr index housed on a server at Cornell; it has been very stable, needing attention perhaps 3 or 4 times in two years.
Who will provide post-implementation tech support to users?  How much will be needed?
  • This should be somebody familiar with RDF and the VIVO ontology and with Solr configuration as well as the idiosyncracies of university data sources; an expert programmer would not be essential except at times of introducing new features, increasing efficiency, or significantly scaling the number of institutions participating
  • Over time one goal would be to size VIVO search at a scale that can support itself plus provide some ongoing funding for DuraSpace and VIVO efforts as a whole; a dedicated technically-qualified support person would help assure the stability

How does that data need to be made available to the linked data indexer?

Via Linked Data/HTTP

What is compatible data?

The VIVO multi-institutional search is predicated on harvesting and indexing RDF compatible with the VIVO ontology beginning with version 1.3. We do not believe that subsequent changes to the ontology since version 1.3 have been significant enough to require changes to the indexer, but this will need to be confirmed with every new VIVO release.

The vivosearch.org site demonstrates that Harvard Profiles produces linked data compatible with VIVO

 

Jonathan – Don't conflate services with sponsorship.

Have to spec out some high level tasks – and make the choices

  • swapping out the index builder
  • do we stay with Drupal or switch to Ruby?
  • how much change is necessary for the UI

 

 

  • No labels