PubMed Harvester retrieves data from PubMed using web services, translates it into meaningful RDF which is then transferred into the VIVO model.
The Process
Extracting
In order to accomplish this part of the process, one will need to decide what data is to be extracted from PubMed.
The PubMed site (http://www.ncbi.nlm.nih.gov/pubmed) is very useful for making such decision and for constructing the search text for Harvester.
Example:
1. Go to http://www.ncbi.nlm.nih.gov/pubmed
2. Click on the "Advanced search" link
3. Search for publications that are linked to University of Florida and were published on June 01, 2011.
- Under the "Search Builder", select "Affiliation" and enter "University of Florida" into the text box. Then click on the "Add to Search Box" button.
- Select "Completion Date", and enter "2011/06/01" into both text boxes, meaning from 2011/06/01 to 2011/06/01. Then click on the "Add to Search Box" button.
- Copy the text from the "Search Box" and paste to a text editor for later use for Harvester.
(University of Florida[Affiliation]) AND "2011/06/01"[Completion Date] : "2011/06/01"[Completion Date]
- Click on the "Search" button.
- One should see results displayed on the PubMed site.
4. In this example, the above search text was used for Harvester. Hence, the default "termSearch" in the file /config/tasks/ufl.pubmedfetch.xml was replaced.
<?xml version="1.0" encoding="UTF-8"?> <Task> <Param name="email">swilliams@ichp.ufl.edu</Param> <Param name="termSearch">(University of Florida[Affiliation]) AND "2011/06/01"[Completion Date] : "2011/06/01"[Completion Date]</Param> <Param name="numRecords">ALL</Param> <Param name="batchSize">1000</Param> </Task>
5. Edit the file /scripts/run-pubmed.sh
Un-comment out this line so that the script points to the task file ufl.pubmedfetch.xml for information about data extraction:
$PubmedFetch -X config/tasks/ufl.pubmedfetch.xml -o $H2RH -OdbUrl=$RAWRHDBURL
Transforming
The harvested data need to be mapped to the VIVO ontology. Since the initial harvest and the desired RDF/XML are both XML, mapping using XSL transformations seemed most appropriate. The details of those transformations had to be clear and distinct.
Translation
PubMed uses the Medline schema for storing citations. Medline is the Medical Literature Analysis and Retrival System Online (Medlars Online). This page shows details about what attributes from PubMed are transformed into what elements in VIVO's schema.
Scoring
Visit this page for general scoring methodology used by the Harvester. By default, PubMed Harvester uses two algorithms, EqualityTest and NormalizedLevenshteinDifference for scoring.
Matching
Visit this page for general matching methodology used by the Harvester.
Changing namespace
Get unmatched Authors into current namespace by modifying the file /scripts/run-pubmed.sh.
Uncomment this line to Execute ChangeNamespace to get unmatched Authors into current namespace:
$ChangeNamespace $CNFLAGS -u ${BASEURI}author/
Comment out this line:
$Qualify $MATCHEDINPUT -n ${BASEURI}author/ -c
Executing
Edit the file /config/models/vivo.xml, and modify "dbUrl", "dbUser", "dbPass", and "namespace" for your specific database settings and VIVO namespace.
Run /scripts/run-pubmed.sh to execute the process.