You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Please note that these are proposed features for VIVO 1.6, not commitments by the VIVO development community

Background

The VIVO 1.6 development time frame is roughly November through March, although no release date has been set and that will be heavily influenced by what is added to the list of features for the release and by the availability of development effort from the VIVO community, including many VIVO installations outside the original NIH-funded VIVO grant that ended in August of 2012.

The list of candidate features and tasks below include both major and minor items from earlier road maps and development planning documents. Where appropriate, stages are suggested for development and implementation to reflect necessary design time or to allow for refinement of an initial design based on user feedback.

Please also note that significant changes are being made to the VIVO ontology through the CTSAconnect project and collaborations with euroCRIS and CASRAI. These changes, when stabilized, will also very likely result in a new kind of release focused almost exclusively on ontology changes and the associated data migration required to convert existing VIVO installations to the modified ontology. Most of the changes will affect the modular structure of the ontology or the class and property names and/or labels.

The following proposed features are not in priority or temporal order.

Please put other ideas here, even tentative suggestions

Performance

There are a number of possible routes to performance improvement for VIVO, and we seek input from the community on what the primary pain points are.

Page caching

If you need your VIVO to display all pages with more or less equivalent, sub-second rendering times, some form of page caching at the Apache level using a tool such as Squid will likely be in your future. Apache is very fast at displaying static HTML pages, and Squid saves a copy of every page rendered in a cache from which the page can be rendered by Apache rather than generated once again by VIVO. The good news:

  • VIVO's Solr (search) index has a separate field for each Solr document indicating the date and time that VIVO page was last indexed following a change; a Solr document corresponds to a page in VIVO.
  • If we make this last update field available as part of the HTTP header's for a VIVO page, Apache and Squid can compare that datetime value to the datetime value in the Squid cache, and only re-generate the page from VIVO when the cached version has been superseded by changes in VIVO.

However, there are several issues to be resolved in making this work as anticipated

  • We need to verify that the field holding the date and time of last update in Solr does indeed reflect all the changes that we think should be reflected. For example, if a 4th co-author on one of 150 publications in VIVO changes his or her middle name, should this trigger a re-index of every other co-author's entire VIVO page content? How far does the impact of an editing change get transmitted, and is that too far (bogging VIVO down) or not far enough (failing to reflect changes that do matter)
  • Pages in VIVO may have many different incarnations depending on whether a user is logged in or not, and if logged in, depending on their level of editing privileges. Is there an easy (read non computationally expensive) way to prevent use of the cache when a user is logged in and has editing privileges for the VIVO individual being requested?

Griffith University has implemented page caching and Arve Solland gave a talk on this and other aspects of the Griffith Research Hub at the 2012 VIVO conference.

Other approaches to performance improvement

There are also other ways to address performance that could be argued are more effective in the long run

  • improved server, Apache, Tomcat, and database configuration and tuning
  • implementing Memcached to cache the results of commonly used SPARQL queries
  • avoid issuing SPARQL queries for the same data repeatedly in the course of generating a single page
  • optimizing SPARQL queries
  • working around bugs in Jena's SDB that make queries other than to a single graph or the union of all graphs much less efficient

Installation and Testing

  • Adding a unit test for Solr that would build an index from a standard set of VIVO RDF, start Solr, and run standard searches. This would help prevent re-introducing problems such as lack of support for diacritics, stop words, and capital letters in the middle of names
  • Developing repeatable tests of loading one or more large datasets

Site and Page Management

  • Make the About page and Home page HTML content editable through admin interface
  • Offering improved options for content on the home page, including a set of SPARQL queries to highlight research areas, international focus, or most recent publications
  • Offering additional individual page template options
  • Offering the ability to embed SPARQL query results in individual pages on a per-class basis – for example, to show all research areas represented in an academic department
  • Cornell is working on new individual page templates that include screen-captured versions of related websites for people and organizations, so that in addition to the link to the website we show either a small or large version of a thumbnail of the page. This is done through a commercial image capture service that other sites may not want to use, and will have to become configurable. Another service might not provide the same API or resultant image size, however. In any case, the new individual page templates will have to be optional, since sites may have done a lot of customization work.
    • Could put the service-specific aspects in a sub-template that gets imported and could be default not attempt to capture and cache images at all
    • are free services out there, but they may not be there in 6 months

Content Curation

Support for sameAs statements

When 2 URIs are declared to be the same in VIVO, all the statements about both will be displayed for either (e.g., Israel and Israel. Improvements are needed, however

  • there is no way via the current UI to add or remove owl:sameAs assertions – they have to be added as tiny RDF files via the add/remove RDF command
  • when another VIVO individual is linked to one or the other of these Israels, the application is not yet smart enough to show only one object property statement to one instance of Israel, and it looks to users like a duplicate of both the country and the relationship
  • Colorado has a use case to assert sameAs relationships between people's profiles in their university-wide VIVO and the separate implementation at the Laboratory for Atmospheric and Space Physics, where additional information about research projects, equipment, and facilities will be stored behind a firewall. They would like to pull data from the CU VIVO into the LASP VIVO dynamically, and pull any publicly visible data from LASP VIVO to supplement the CU VIVO content about a person.
  • Colorado also has a need to pull data from the Harvard Profiles system used by the University of Colorado Medical School in Denver to the CU VIVO without replicating more than is necessary. This is a similar use case to the connections needed between VIVO on the Cornell Ithaca campus and the Weill Cornell VIVO.

URI Tool

The URI Tool is a simple application designed to facilitate data cleanup in VIVO following ingest, often from multiple sources. The tool can be configured to run one of four or five pre-defined queries to identify journals, people, organizations, or articles with very similar names. A bare-bones editing interface allows a relatively untrained user to step through lists of paired or grouped candidates for merging, identify which existing properties to keep, and confirm that the candidates should be merged. Links to the actual entry in VIVO facilitate verification. When the review process is complete, the URI Tool application writes out both retraction and addition files, which can then be removed from or added to VIVO using commands on the ingest menu.

This tool does not replace the need for author disambiguation and other cleanup work prior to ingest, for which the Google Refine extensions for VIVO and the Harvester tool have been developed. However, it does have the potential to become a considerable time saver for cleaning data once loaded into VIVO.

  • further generalization and documentation of Joe McEnerney's URITool, including support for finding and removing/correcting data in only one named graph
  • improvements to the interface

Editing

  • VIVOIMPL-15 improve the permissions scheme for editing and make its functions more transparent to users
  • implement round-trip editing of VIVO content from Drupal or another tool external to VIVO via the SPARQL update capability of the RDF api introduced in VIVO 1.5
  • improve editing of roles from the organization, event, or other entity that the role is realized in or contributes to
  • assess implications for the application of moving to fewer classes and properties and shifting the current differentiation of property groups to be based on a combination of a property and its related range class(es)

Other candidate issues relating to content editing

  • NIHVIVO-1126 support for editing ref:type of individuals via primary editing interface, not just the admin editing interface
  • NIHVIVO-1125 class-specific forms for entering new individuals
  • Put and delete of data via LOD requests
  • NIHVIVO-715 pick lists could do a better job of remembering a user's previous choices

Ingest Tools

  • Integrating Mummi Thorisson's Ruby-based CrossRef lookup tool for searching and loading publications into VIVO, on GitHub along with OAuth work for retrieving information from a VIVO profile in another application
  • Improving and documenting the Harvester scoring and matching functions
  • Implementing a web service interface (with authentication) to the VIVO RDF API, to allow the Harvester and other tools to add/remove data from VIVO and trigger appropriate search indexing and recomputing of inferences.

Internationalization

  • Moving text strings from controllers and templates to Java resource bundles so that other languages can be substituted for English
  • Internationalization for ontology labels – important because much of the text on a VIVO page comes directly from the ontology
  • Improving the VIVO editing interface(s) to support specification of language tags. Note that VIVO 1.5 will respect a user's browser language preference setting and filter labels and data property text strings to only display values matching that language setting whenever versions in multiple languages are available – but there has not yet been a way to specify language tags on text strings.

Provenance

Adding support for named graphs in the UI

  • Allowing the addition of statements about any named graph such as its source and date of last update
  • Making this information visible in the UI (e.g., on mousing over any statement) to inform users of the source and date of any statement, at least for data imported from systems of record

Visualization

  • improved caching of visualization data
  • improved scalability (e.g., for a Map of Science from the 32,000 plus publications in the University of Florida VIVO)
  • HTML5 (phasing out Flash)

Data Query and Reporting

  • Limiting SPARQL queries by named graph, either via inclusion or exclusion. This is allegedly supported by the Virtuoso triple store. This would help assure that private or semi-private data in a VIVO could be exposed in via a SPARQL endpoint
  • There are other possible routes for extracting data from VIVO including linked data requests – if private data is included in a VIVO, all query and export paths would also have to be locked down. Linked data requests respect the visibility level settings set on properties to govern public display, but separate more restrictive controls may be required for linked data.
  • Enhancing the internal VIVO SPARQL interface to support add and delete functions, not just select and construct queries

CV generation

  • Improving the execution speed and formatting of the existing Digital Vita CV tool as implemented in the UF VIVO, perhaps changing it to email the generated CV as a rich text or PDF document asynchronously
  • Developing a UI for selecting publications or grants and adding required narrative elements, based on the specification developed in the Digital Vita VIVO mini-grant

Weill's VIVO Dashboard

Paul Albert has been working with a summer intern and others at Weill Cornell to develop the Drupal-based tool for visualizing semantic data. This project provides a number of candidate visualizations and reports that will likely be of interest to other VIVO adopters, and there may be enhancements to VIVO that make this kind of reporting dashboard easier to implement.

Search Improvements

Indexing improvements

  • Provide a way to re-index by graph or for a list of URIs, to allow partial re-indexing following data ingest as opposed to requiring a complete re-index
  • Improving the efficiency and hence speed of search indexing in general
  • improved default boosting parameters for people, organizations, and other common priority items
  • an improved configuration tool for specifying parameters to VIVO's search indexing and query parsing
  • a concerted effort to explore what search improvements Apache Solr can support and recommendations on which to consider implementing in what order
  • implementation of additional facets on a per-classgroup basis – appropriate facets beyond ref:type, varying based on the nature of the properties typically present in search results of a given type such as people, organizations, publications, research resources, or events.
  • note the search unit test proposed above under Installation and Testing.

Modularity

Jim Blake did significant work during the 1.5 development cycle learning about and the OSGi framework and exploring how it could be applied to VIVO, as documented at Modularity/extension prep - development component for v1.5.

  • Yin's alternate search approach at NYU that indexes everything in the context of connections to people and displays results only for people could be of interest to others but would require modularity in search indexing code as well as other ways that the search index integrates with VIVO
  • No labels