Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Calls are held every Thursday at 1 pm eastern time (GMT–4 in daylight savings, GMT-5 standard time) – convert to your time at http://www.thetimezoneconverter.com

View and edit this page permanently at https://wiki.duraspace.org/x/O-MQAgJfcQAg, or use the temporal Google Doc for collaborative note taking during the call.

VIVO is hiring!

DuraSpace is seeking a dynamic and entrepreneurial Project Director for the open source VIVO project (www.vivoweb.org), a world-wide community focused on creating software tools, ontologies, and services. The VIVO Project Director will have the opportunity to play a major role in a collaborative movement that will shape the future of research.

See full posting – applications are scheduled to close on or near October 23rd. Note that there is no requirement to be a U.S. citizen.

Release update

Hoping to start testing next Monday when Jim returnsNo release candidate has been created yet – progress each day.

Apps and Tools Group

Notes from Sept. 24 meeting recorded as this webcast showing a Python data checker for VIVO developed at the University of Florida.

Next meeting on Tuesday in two weeks (October 8) at 1pm Eastern .

Demonstrated a set of Python tools developed at UF to run a set of SPARQL queries nightly to detect malformed or missing connections, duplicate identifiers, and data that should not be in VIVO due to privacy concerns. Reports come back as plain text that gets emailed out, and reports are structured to return a value of zero when there are  no anomalies that need to be addressed.  So far it’s just a notification tool, and doesn’t do the cleanup.

The queries are pretty generic and are intended to be easily modified through a configuration file, or they could be run using a tool like CURL.

The tools are available at http://github.com/nrejack/dchecker and while still being developed are usable already. A demonstration video is available at: http://www.youtube.com/watch?v=8Lz4V7HuETk.

Chris put examples of the Apache rewrite rules in last week’s Implementation and Development call notes, along with code used to generate the list of mapping rules from UF Gator IDs to VIVO URIs.

– Stephen Williams from the University of Colorado will host. Stephen has posted an agenda to the vivo-dev-all mailing list.

Paul -- great that you are recording the sessions and posting them to YouTube.

Upcoming Events

Upcoming Events 

  • 2nd Annual CASRAI International Conference, October 16-18 in Ottawa
    • Conference streams: Reconnect Big Data, Reconnect the Library, and Reconnect the Machine
    • http://reconnect.casrai.org
    • Jon will be presenting on VIVO, along with Memorial University
  • 1st Annual UCosmic Conference, October 31 in New York

Updates

  • Brown (Ted) - finalizing public rollout schedule, adding

    • finalizing import of data from existing

    profile system, working on scripts to find new publications for faculty from Web of Science (using the Lite API), Pubmed, Springer Metadata APIs.  
  • Colorado (Alex) - no major updates on VIVO implementation, working on bringing publications data into Elements for first 1,200 faculty/researchers, about ⅓ way thru first pass of curation, finding DOIs very useful where they are found, including in some of the Bibtex or RIS data sources like EBSCO, ProQuest, and Google Scholar. Elements has a Bibtex importer but has to be filtered by the individual author and assumes the author is already claimed, rather then a Bibtex file having publications for a large group of authors with the option of having publications only marked as still pending approval by the author. Bringing up a new server for VIVO 1.6.

  • Cornell (Brian, Tim, Jon, Huda) -- working on VIVO 1.6, merging in the changes for the VIVO-ISF ontology and grepping the rest of the code for examples of property names that have changed. Adding the ability to pull in Library of Congress Subject Headings (LCSH) with the LOC-assigned URIs.

  • Duke (Richard) - Mainly working on some data cleanup tasks with orphaned entities - like publications that aren’t linked to anyone.

  • EPA (Cristina, Zac) - No updates on going live. Hoping there is no government shutdown next week to delay us. Currently working with the Freemarker SPARQL Data Getters for some custom reports and having a few issues with that (Alumni and Expertise reports).  Tried out the example, and we think it works, but we don’t have any orgs classified as an academic department.  Trying to find where people went to school so can see all the people at EPA who went to a given school, as well as all the people associated with an EPA-defined vocabulary term defined in SKOS vs. terms brought in from UMLS or GEMET (this may be related to the external terms only being typed as owl:Thing).  Getting some duplicate values that are attributable. The actual queries work, but when we try to make a custom template to better format the information (.ftl) it fails.

  • Florida (Nicholas) -- see update above from the Apps & Tools meeting

  • RPI (Patrick West, Yu Chen) - Ticket mentioned below. Also question of authorization policies -- want users to be able to only see certain information from VIVO. Also are using Drupal for some authentication and creating groups in Drupal to manage this; not so much the authorization piece for editing -- is more about more data or visualizations to display given group membership. Will put together a use case or two or three and will share.

  • Stony Brook (Tammy) - Working on process flow for gathering and transforming information for the new vivo.stonybrookmedicine.edu website, starting with basic demographic information and grants, with publications to come later. Also changed the name from vivo.stonybrook to vivo.stonybrookmedicine to reflect institutional preference and engagement.

  • Weill Cornell (Paul) – Going live on January 7. Interested in hearing back about performance testing. Have a new, better-performing server set up, but still concerned about performance under multiple simultaneous editing sessions. Using a server-based performance monitoring tool

  • Colorado (Alex & Stephen) In the middle of Elements curation and behind on listserv responses

    • Working on Elements publications curation for 2013

    • Stephen will be catching up on VIVO emails after recovery from flooding

  • Cornell (Jon, Jim, et al.)

    • 1.6, 1.6, 1.6, 1.6, 1.6…

  • Duke (Richard)

    • reloading grants data from our source system. we put grants into their own graph and then wipe/reload that during a full grant load process. most days we just do an incremental load.

    • search re-indexing process taking a really long time >5h, sometimes ~1.5h -- doesn’t seem to correspond to the number of new triples -- looking forward to incremental re-index

      • UF had an issue with bad characters taking a long time to fail

      • new version of Solr in latest VIVO repo

      • Jon -- any correlation to inferencing? Richard indicated no, not running inferencer as they ingest all triples to not require it -- using a Ruby script that they could possibly extract and share

  • Florida (Chris)

    • Had a second successful run of people ingest from people soft

      • Developing a weekly process

      • Deploying ingest from git repository

    • Working on visualizations with d3js (http://d3js.org/) and JSON -- which can be generated from within VIVO -- Javascript visualizations seem fast! Probably demo in last October Tools call.

  • NYU (Yin)

    • talking to production group about graduation project -- been working as a dev/research project

    • been using an intermediate data format for getting data into VIVO, but prod group wants to connect VIVO (?) to enterprise data warehouse -- are there best practices for this? Jon clarified if they want a realtime connect vs ETL -- suggested that the closer the transformation gets to RDF, the easier it is to bring it into VIVO -- Ted happy with Python and RDFlib, UF been using Python and starting to use RDFlib

    • https://github.com/ufvivotech/ufDataQualityImprovement/tree/master/vivotools 

    • https://github.com/nrejack/dchecker 

    • question about not using front end, rather back end RDF via XML or URLs, and Solr search?

    • Also suggested VIVO hardware requirements? Chris suggest AWS specs on wiki.

      • AWS Specs for UFL VIVO Hosts:

      • X-Large-Memory 

        • 17.1GB (ram) 

        • 6.5 EC2 (cores) 

        • 420GB 

        • 64-bit moderate 

        • m2.xlarge

      • 2X-Large-Memory 

        • 34.2GB (ram) 

        • 13 EC2 (cores) 

        • 850GB

        • 64-bit high 

        • m2.2xlarge

  • Scripps (Michaeleen)

    • Stella has a working version of the grants ingest from NIH Reporter. Ingest program written for 1.5.1. Not sure if she should share for that reason?... Jon: it would be helpful to post regardless!

    • Stella is also working on authorship representation.

    • Representing patents

  • Stony Brook (Tammy)

    • Using JSON to integrate at data interface between Java and Python dev efforts

  • UCSF (Eric)

    • Bringing in grants from NIH Exporter -- Jon mentioned concern of annual updates to long running (25y) NIH grants -- Stella’s looked into how to best represent these in VIVO ontology

    • Author registry idea; would be compatible with ORCID and include ORCID ID -- aim for lower policy hurdles

    • Anyone look at Project Honeypot tools to keep bad traffic away from site? HTTP Blacklist catch around 10k HTTP requests per day. UF also blocked access to CPU heavy pages like the visualizations, for web spiders that don’t honor robots.txt.

  • Weill Cornell (Paul)

    • reconciling self-reported publication data with data from VIVO instance -- very few pubs rejected, many were duplicates already in VIVO -- Ted offered some good advice

    • template updates

      Ted -- the sysadmin at Brown did testing with JMeter, and developed a suite of tests that included logging in and making edits to a publication; JMeter has lots of tools for simulating 10 users at a time.  At Brown the effort was to get a baseline measure of performance.

Notable list traffic

See the vivo-dev-all archive and vivo-imp-issues archive for complete email threads

Trying to subclass VitroHttpServlet to create customized view controller... which file should we change in order to register the new view controller? Also, how do I wire the new view controller that I created to other view controller? And how should I process the VitroRequest instance to redirect to another view controller? (Yu@RPI). If the objective is to go back to specific URL, there are methods in the Generator class that allow specifying where the user should be directed following a custom form (if this is what you are trying to do).  Also want to customize the page to edit a property of an instance; there’s a post() request from the page but is trying to find the corresponding doPost() method that resolves the post() request sent from the page. (Huda) -- we use a Generator class to accomplish this, normally. (Yu) -- want to change to a multipart post request so can upload images (documents, datasets, whatever) as well as submit data in the response.

For the RPI question, can we get a contact email for both people who answered the two possibilities? The tomcat controller, and the method in the Generator class? hjk54@cornell.edu

Is it okay to have 2 different VIVO instances running on the same server? Can both of the web site use the same solr index or what is the best way to do it? (Gawri@Queensland University of Technology)

1. PubMed Harvester doesn't like particular records (Lynda, Andy)

  • The Harvester is essentially unusable with PubMedFetch, due to bugs in code from NIH.  Some records in PubMed have data which is not correctly handled by the NIH code. It's possible to work around these bugs by using PubMedHTTPFetch instead of PubMedFetch.  However, you need to URL-encode your search request if using the HTTP version.

2. ExternalAuthId and named graphs – fixed by Jim and tested by Ted VIVO-305 - Matching property should work if the triple is in a named graph (not just kb-2) - RESOLVED

3. VIVO and TDB (Michel, Ted, JohnF) – have VIVO working with a TDB database, instead of SDB (so a relational database back end store is no longer needed)

  • Ted: Fuseki 1.0 was released last week and I was able to get that to connect to an instance of VIVO 1.5 using the same endpoints you specified:

  • Michel: I now want to write a java program with Jena, where I insert data into the TDB. I want to use the Jena api, with model.createResource and resource.addProperty and so on.

  • Ted: I use Python and RDFLib [1] for VIVO data loading. RDFLib, as of version 4, supports SPARQL 1.1 so you could use that to write directly to Fuseki

    • As for learning about the VIVO ontology, one technique that I've heard recommended and find useful is to use the VIVO admin to create resources that you want to load (FacultyMember, Book, etc) and then inspect the RDF that is generated to see how the data is modeled. VIVO will serve Turtle for a resource (e.g. n1234) by pointing your browser at http://localhost:8080/vivo/rdf/n1234/n1234.ttl.

  • JohnF: Specifically, take a look at the org.vivoweb.harvester.utilrepo.JenaConnect class. It's an abstract class that is extended by SDBJenaConnect and TDBJenaConnect. It should give you a good idea what you'll need to do to insert RDF into VIVO using the Jena API.

4. Finding grants via investigator name (Michaeleen) – having an investigator relationship with a grant is not sufficient to make the grant show up in the results for a search on person name.

5. Uploading image when editing a property of an individual (Yu, Huda, JohnE, PatrickW, BrianC)Jon: my instinct is one VIVO per Tomcat for anything that is going to production -- we’ve run multiple VIVOs on development machines but notice that performance degrades when we add significant amounts of data. The general approach if you have to run more than one VIVO on the same server is to have an individual Tomcat per VIVO, running on different ports. For example, Stony Brook runs 2 VIVOs on 2 separate virtual machines.

Call-in Information

Date: Every Thursday, no end date

Time: 1:00 pm, Eastern Daylight Time (New York, GMT-04:00)

Meeting Number: 641 825 891

To join the online meeting

Go to  https://cornell.webex.com/cornell/e.php?AT=WMI&EventID=167096322&RT=MiM2

If requested, enter your name and email address.

Click "Join".

To view in other time zones or languages, please click the link:  https://cornell.webex.com/cornell/globalcallin.php?serviceType=MC&ED=167096322&tollFree=1

If those links don't work, please visit the  Cornell meeting page  and look for a VIVO meeting.

To join the audio conference only

To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.

Call-in toll-free number (US/Canada): 1-855-244-8681

Call-in toll number (US/Canada): 1-650-479-3207

Global call-in numbers:  https://cornell.webex.com/cornelluniversity/globalcallin.php?serviceType=MC&ED=161711167&tollFree=1

Toll-free dialing restrictions:  http://www.webex.com/pdf/tollfree_restrictions.pdf

Access  code:645  873 290