For the previous version of this page dealing with the harvester 1.1.1 and previous versions, see Pubmed Example Script (1.1.1)

Your first Harvest

Since pubmed is a national data source, much of the work has been already done for your harvest. No translation file needs to be created, nor does the workflow need created as the harvester team has completed these steps for you. However, you will need to do the some configuration so the harvest knows where your vivo data is and the pubmed records you wish to ingest.

  • Change directory to example-scripts/bash-scripts/full-harvest-examples/example-pubmed
  • Edit the pubmedfetch.config.xml file
    • Set the email parameter to your email address
    • Set the termSearch to your search. The search term is the same syntax as found on pubmed.org
    • For more information on these parameters and their use, please see PubmedFetch
  • Edit the vivo.model.xml file
  • Edit changenamespace-authors.config.xml, changenamespace-authorship.config.xml, changenamespace-journal.config.xml, and changenamespace-publication.config.xml files and set the namespace parameters in each one to be your vivo namespace
    • For more information on these parameters and their use, please see ChangeNamespace
  • Edit the run-pubmed.sh file and set the HARVESTER_INSTALL_DIR= to be the directory you unpacked the harvester in
  • Run bash run-pubmed.sh
  • Restart tomcat and apache2. You may also need to force the index to rebuild to see the new data. The index can be rebuilt by issuing the following URL in a browser:http://your.vivo.address/vivo/SearchIndex. This will require site admin permission, and prompt you to login if your not already.

The first run

Three folders will be created

  • logs
  • data
  • previous-harvest

The logs folder contains the log from the run, the data folder contains the data from each run, and the previous-harvest folder contains the old harvest data for use during the update process at the end of the script. While you're testing, I would recommend treating each run as the first run (so no update logic will occur). You can do this by removing the previous-harvest folder before running again.

Inside the data folder, you will find the raw records utilized during the ingest. To see what rdf statements went into VIVO, you can view the vivo-additions.rdf.xml file. Conversely, to view what the harvester removed (because of updated data), you can view the vivo-subtractions.rdf.xml file. This file will be blank on your first run, since you have no previous harvest to compare the incoming data against.

Followup Runs and Queries

Please treat each separate query as a separate script. This will ensure the update process performs proper comparisons and you won't get unexpected/undesirable results

If your running the script over and over but are changing the pubmedfetch.config.xml query parameter, this could cause undesirable results with the update process. If you run the script with a previous-harvest model from an old query, the script will attempt to run an update for the old query using the new query data. This is going to cause some data to be added / removed incorrectly as the comparison should only happen using the same input set.

It is recommended to execute a single harvest and then run remove-last-pubmed-harvest.sh to remove the rdf each time you run a test harvest until your satisfied with the results. If you then want to change the query to harvest more data in, make a duplicate copy of the example-pubmed folder and run the script from there (be sure you remove the previous-harvest folder before your first run).

Optimizing

Once you are ready to run a large dataset, it is advisable to the record storage from files to a database. Although this will make it harder to find individual records, speed and performance will be increased during the fetch and translate stage. To do so:

  • Edit the nano raw-records.config.xml to use TDB, which is a semantic data store

    <RecordHandler>
            <Param name="rhClass">org.vivoweb.harvester.util.repo.JenaRecordHandler</Param>
            <Param name="type">tdb</Param>
            <Param name="dbDir">data/raw-records</Param>
    </RecordHandler>
    
  • Edit the translated-records.config.xml to use TDB, which is a semantic data store

    <RecordHandler>
            <Param name="rhClass">org.vivoweb.harvester.util.repo.JenaRecordHandler</Param>
            <Param name="type">tdb</Param>
            <Param name="dbDir">data/translated-records</Param>
    </RecordHandler>
    
  • No labels