Purpose

  • The project charter defines the scope, objectives, and overall approach for the work to be completed. It is a critical element for initiating, planning, executing, controlling, and assessing the project. It should be the single point of reference on the project for project goals and objectives, scope, organization, estimates, work plan, and budget. In addition, it serves as a contract between the Project Team and the Project Sponsors, stating what will be delivered according to the budget, time constraints, risks, resources, and standards agreed upon for the project.

Executive Summary

  • Currently, version 1.0 of the VIVO Harvester application has been successfully used to harvest grant information from the UF Division of Sponsored Research (DSR). However, there is no mechanism in place to allow harvesting of DSR data on a recurring schedule. This purpose of this project is to plan, design and implement a recurring harvest of DSR data into VIVO at University of Florida. Ultimately, this project will lay the groundwork for future implementations of recurring data harvesting from additional data sources. This project will be referred to as “DSR Reproducible Harvest”. It is likely that this will be an upgrade to the existing Harvester application as opposed to a standalone application. This project is considered to have a minimal time investment with a high value to VIVO @ UF upon successful completion. This project should define a process that is reproducible to support any data source to be harvested using the Harvester.

Goals

  • The following are major goals of this project:
    • Assemble a project team and define roles
    • Analyze and document the current status of the Harvester as it pertains to a DSR harvest
    • Analyze the current status of the development, staging and production VIVO data as it pertains to DSR data
    • Design the DSR Harvest software
    • Write a functional specification
    • Build the DSR Harvest software
    • Build a logging and email notification system
    • Write a technical specification
    • Implement, test and refine the system in a development environment
    • Implement, test and refine the system in a staging environment
    • Implement and test the system in a production environment
    • Contribute the software to the community via Source Forge

Objectives

  • Team assembled
  • Roles defined and disseminated
  • Project plan defined
  • Timeline created
  • Harvester analyzed
  • VIVO development, staging, production environment analyzed
  • Harvester for DSR designed
  • Functional specification written
  • DSR Harvest application built
  • Technical specification written
  • Development version implemented, tested, refined
  • Staging version implemented, tested, refined, approved by sponsors
  • Production version implemented, tested, approved by sponsors
  • Notification system implemented
  • Server specifications(s) updated to reflect implementation of the Harvester
  • Source code contributed to the community site at Source Forge

Scope

  • The scope of this project is limited to reproducible harvesting of DSR data at the University of Florida. No additional data will be harvested or tested as a part of this project. This project does not include support for the DSR Harvesting beyond the production implementation. Support will be provided through the normal channels of communication via Source Forge. This project is not a new feature of the current Harvester. It is a separate application that will use the Harvester as a tool to complete its processing.

Assumptions

  • It is assumed that Harvester 1.0 is able to successfully harvest and map data into VIVO.
  • It is assumed that Harvester 1.0 is able to re-harvest and ignore existing or unchanged data.
  • It is assumed that all Grants are removed from VIVO prior to the first harvest of data from DSR
  • It is assumed that each person working on the project ensures that he/she dedicates a reasonable amount of time to the project with regard to FTE on the VIVO grant.
  • It is assumed that DSR representatives will not charge a fee for meeting, data feeds or data acquisition.
  • It is assumed that CTRIP will have Alex Rockwell as a resource in person for at least three days a week until completion of the project as defined by the project manager.

Risks

  • A medium risk exists that scope creep may occur. To mitigate this risk, the project charter will be reviewed on a weekly basis to ensure that any tasks out of scope are presented to the sponsors.
  • A high risk exists that the dates of milestones defined in the timeline may not be met for many reasons. To mitigate this risk, all actors in the project will have ongoing access to the timeline and will be expected to review it on a weekly basis.
  • A low risk exists that certain actors will not be available when needed. To mitigate this risk, meetings with the sponsor(s) will be held weekly so decisions about how to proceed can be agreed upon.
  • A low risk exists that the DSR data will not be clean. To mitigate this risk, a process will be defined that excludes data that does not fit the minimum requirements for the Harvester.
  • A medium risk exists that the specifications may not exist or may not accurately represent data mappings. To mitigate this risk, the functional specification designed for this project will clearly define data mapping.
  • A medium risk exists logging and notifications may be difficult to accurately represent in an automated fashion.
  • A medium risk exists that the harvester will have have bugs or may not be able to successfully reproduce a harvest without bugs or problems. To mitigate this risk, thorough discovery and testing will be conducted with the developers of the Harvester.
  • A risk exists that the systems and infrastructure may not support the required software or processes needed to implement the reproducible harvesting. To mitigate this risk, a thorough evaluation of the Development server will be conducted after the design phase of the project.

Organization

Person

Organization

Role

Mike Conlon

PI

Sponsor

Valrie Davis

MSL

Sponsor

Narayan Raum

CTRIP

Project Manager

Christopher Barnes

CTRIP

Harvester Project Manager

James Pence

CTRIP

Harvester Technical Contact

Alex Rockwell

MSL

Implementation Expert

Logan Clapp

MSL

Implementation Expert

Nick Dunham

DSR

Data source

Resources(Costs)

Person

FTE

Mike Conlon

.05

Valrie Davis

.1

Narayan Raum

.25

Christopher Barnes

.1

James Pence

.25

Alex Rockwell

.66

Timeline

  • The following timeline is a general overview of the project timeline. For a more detailed timeline, please refer to the DSR Harvest timeline spreadsheet, also in a Google Doc shared by the project manager.

Item

Start Date

Finish Date

Project Start

4/22/2011

4/22/2011

Project Plan

4/25/2011

4/29/2011

Discovery

5/2/2011

5/6/2011

Design

5/9/2011

5/13/2011

Build

5/16/2011

5/27/2011

Dev Implementation

5/30/2011

6/3/2011

Staging Implementation

6/6/2011

6/10/2011

Production Implementation

6/13/2011

6/17/2011

Contribute Source to SF Site

6/20/2011

6/24/2011

Monitor Harvesting

6/20/2011

6/24/2011

Project Complete

6/24/2011

6/24/2011

Addendum

Installation

  • .deb from sourceforge v1.1.1
  • install dpkg
  • Installs to /usr/share/vivo/harvester/
  • shell script exists, specific to DSR at UF

Configuration

  • vivo.model.xml - config file /harvester/example-scripts/example-dsr/
  • Settings must must comply with deploy.properties settings for VIVO
  • /harvester/example-scripts/example-images/jdbcfetch.config.xml
  • where clause in query reuired for confidential filtering
  • all others example where clauses are optional, allow subsets for testing
  • Only changes to example file necessary are connection, username, password
  • all others are part of the harvester application

DSR Models

  • Views have been created by DSR
  • contracts, project team and view_vivo
  • Fields - fields from the views to be harvested

Fetch

  • Full fetch takes approximately 3-4 hours
  • logs are stored in harvester directory
  • Harvester does not report or log when data is omitted. If it doesn’t meet requirements of the query, it is ignored
  • “Dirty” data that meets query requirements will be harvested, curation at the source then required

Scoring

  • Scored fields: UFID, Contract Number, Sponsors, Flow through sponsors

Curation

  • Assumed data will be corrected at the source, data that is scored on during harvest:

Update

  • Compares backup of previous fetch to new fetch
  • Removes deleted fields, triples
  • Adds new
  • Updates are actually a delete and add of triples
  • Duplicates reduced nearly 100%, rare due to new Smush feature in v1.1.x
  • If data is curated in VIVO and not at the source, duplicates may occur
  • Rectify: If duplicate occurs, curators responsible for manually removing edited triples
  • Note: In vivo 1.2, if triples are identical, only one displayed. If edited, then two will display
  • New harvester feature suggested:
    • Ignore harvest of records that have been curated in vivo
    • Should also require curating at the source, not the target

Apply

  • In Vivo 1.2, needs a re-index of the database
  • New harvester feature suggested:
    • Upgrade harvester to include a function that rebuilds the luceneIndex upon completion of harvest and ingest
  • James Pence asking Cornell about vivo functions for luceneIndex reindexing
  • Without this, manual indexing via admin interface required, not acceptable

Logging

  • Log files include all details including stack trace
  • Harvester has a configuration in scripts.env, allows console log level to be customized to “info”, which will allow capture of actions through the process.
  • Perfect for reproducible harvest email log to sysadmins
  • Don’t want to parse language in logs, language can change in java, harvester, etc.
  • Suggested attach full log file to each email notification to sysadmins

Suggested Documentation

  • Errors in log file and what they mean as part of this project in the email

Questions

  • Are grants editable in VIVO by users?
  • If we had a last modified by, could we use this to determine if data was curated at the target?
  • No labels