This was a meeting between the UF developers and UF implementors. This meeting served as training to the local team, and development of documentation for the harvester.
Date: 1/14/2010
Attendees: Alex Rockwell, Christopher Case, Christopher Haines, James Pence, Michael Barbieri, Nicholas Rejack, Nicholas Skaggs, Stephen Williams
Agenda
- Development Toolset
- Eclipse Setup
- Maven setup
- SVN checkout
- Project settings
- Project Layout
- Packages and Classes
- Configuration and Script Files
- Harvester Overview
- 1.0 Modified Features and affect on scripts
- Java Optimizations
- PubMed, PeopleSoft, & DSR Harvests
- Fetch Configuration
- XSLT Overview and Modification Overview
- Script Flow
- What the script is doing (flags explained, etc)
- How to backup in stages and restore from backups
- Logging provided and what to look for
Notes
under tasks
- driver, connection, username and password need to be set.
- The xml also includes which tables are being fetched from.
the project so far is maintained only Linux, though Windows is possible. MACOS
Each module can be ran as a java library or within command line scripts
src/main/java contains the various modules of the harvester
- Score - calculates the scores and puts them in a separate model
- When calling on the algorithms they are assigned weights
- Algorithm - contains the various scoring mechanisms for equality
- Equality Test - for just 100% matches.
- Match - matches with the score data.
- ChangeNamespace - takes all of one namespace and changes them to a new namespace. Suggested for use after scoring and matches, cleans up the remainder
- Jena is a RDF data store.
- JenaConnect – (rdb,sdb,mem versions) an interface to use/expose the jena models. SPARQL querys execution. Outputs to the command line
- Qualify - allows modification of the data. May develop a new way of approaching
- Diff – creates the differences of the jena models using the jena methods.
- Transfer – from Record Handler or model to a jena model (not from a model to record handler)
- the args folder – for dealing with the command line arguments.
- Merge – specify a regex style call to merge records into a single record. Allows cross table references during translation. (ie. Fetch → merge → translate)
- XSLTranslator – uses the Java library to translate the data
- the other translate options need work.
- config/datamaps contains the various XSL files for the data transforms.
- Including regex abilites, separated out into different Templates.
src/main.resources
- LogBack – the logging system which is configuarble in “logback.xml” (Fixing the Timezone? GMT right now.)
- XSLT templates
- creating links with one template and creating a stub for the uri.
- The namespaces used within the harvest is specific to the type of object fetched.
- They use their own regex formats. (Be sure to reference xsl version
Panel
- Pubmed – Parsing out the name is happening using the analyze string
- MESH heading - descriptor and qualify skipped.
- Types are forced to lowercase for identification. (fixes Types)
Commit standard steps:
- 1.update to head
- 2.inspect or resolve any merges.
- 3.Commit the changes.
- Harvester steps:
- 1.Fetch
- (a)the easiest way to inspect the data is a to use a text file record handler
- i.a non text file record handler suggested for production.
- (a)the easiest way to inspect the data is a to use a text file record handler
- 2.Merge
- (a)used in people soft to facilitate data relations
- (b)Not needed for JDBCFetched data
- 3.Translate
- (a)turns data into rdf
- 4.Transfer
- (a)moves data to a Jena model
- 5.(Data can now me mounted and viewed in VIVO but is not suggested)
- (a)not to be used in production
- 1.Fetch
- JenaModels
- Added a type parameter for defining the db format of the back-end jena db
- The default namespace is found here to ensure consistency.
- Drivers
- H2, HSQLDB,MYSQL are in the dependencies.
- Record Handlers
- XML is the version for raw XML
- RDF is the RDF/XML version
Scripts
the scripts env file sets many of the used parameters
- Date format, vivo connection,the various tools locations and optimization flags.
In the PubMed the the algorithms for the work email and name parts are implemented.
In the DSR
Match needs:
- -r to replace the found matches
- -lpred1=pred2 to form a link
- -o returns an output model of the matches.
- -t is the the threshold within which the matches will be considered true.