Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Analysis of ISBNs aggregated under the same Opus
    • Our first pass at queries had taken too long, so we broke the process up into separate portions. 
      • First, we queried the PCC data using Dave's Fuseki server (or a copy of the data on our own Fuseki server) to retrieve a list of all Opera that had at least two works with instances with ISBNs.  The query we used is captured here. Executing this query resulted in the following list of Opera URIs.  
      • This script takes the list of Opera URIs and executes SPARQL queries to retrieve the ISBNs of any instances that correspond to different work URIs aggregated by that Opus.  Running the script results in a file where each line has sets of ISBNs corresponding to an opus (Note that the script has the Fuseki SPARQL URL not included so running the script would require replace that part of the code with the Fuseki SPARQL URL you wish to query.)
      • Another script then takes the file with the ISBN groups to check which of these groups results in at least two catalog matches.  The script outputs the ISBN groups that result in matches long with a summary (i.e. total number of rows, total number of matches, etc.), the ISBN groups that listed in only one match, and those that didn't result in a match at all.  The output is captured here.
  • Analysis of LCCNs aggregated under the same Opus
    • Similar to our ISBN analysis, we first queried the PCC data to generate a list of all Opera that have at least two works with instances with LCCNs.  The query is captured here and the results here.
    • The same script used above to execute SPARQL queries is also used for querying this list of Opera to get the LCCNs grouped under each opus.  The line used for LCCNs is commented out at the bottom of the code.  For LCCNs, this script output the following file where each line has a set of LCCNs grouped under the same Opus.
    • This script analyzes these LCCN groups to see which have more than one catalog match and lists the groups that resulted in a match along with a summary of total rows processed and the number of matches.  The output file is here and also contains the rows that resulted in only one catalog match and those that didn't result in any matches.

POD data analysis

  • We wanted to analyze the POD (Platform for Open Data) transformation provided by Jim Hahn (University of Pennsylvania) to see if we could retrieve matches for our set of ISBNs that fell under the same LOC Hubs and that only had a single match in the Cornell catalog.  This transformation provided sets of CSV files per institution, where the headers represented MARC fields and the rows contained values per MARC record mapped to those fields.
    • This file contains the list of ISBN sets that resulted in a single match in the Cornell catalog.
    • This script reads in this file and compiles the ISBNs that occur in the file.  The script then reads in the transformed CSVs which contain POD data mapped to MARC fields and values.  If the script finds any of the target set of ISBNS represented within an institution's transformed data, the script then outputs the transformed rows which match any of these ISBNs.  The results of this script are included here, with matching rows for Brown, Chicago, Columbia, Dartmouth, Duke, Harvard, Penn, and Stanford.
      • A separate script retrieves the Cornell catalog record information for matching ISBNs, resulting in this file which lists the original set of ISBNs we were querying against, followed by the catalog id and title of the record, followed by the ISBNs for that item captured in the record itself.
    • Using the results from the previous step, this script reads in the information for MARC records matching the original set of ISBNs we are querying against, and uploads information to a Solr index we set up specifically to allow us to store and search across these multiple records.  We also add the institution information to the record, to specify where the data is coming from. If there are records that are not added due to insufficient information, the output identifies those records. In this case, three records from Brown did not have 001 fields and were not added to our Solr index.
    • This script uses the original file with ISBN sets that result in a single Cornell catalog match, and queries both the LD4P3 copy of the Cornell Solr index and the POD index set up for this analysis to find which catalog records across these institutions matches these ISBNs sets.  The output generated is in the form of an HTML page available here.

Fuseki UI