BANG! Scripts

We used various scripts to analyze different data sources.

LOC Hub Analysis

We used client-side AJAX queries to retrieve the first 10,000 hubs from LOC and then navigate to related works and instances to analyze how many LOC Hubs provide two or more instances with ISBNs or LCCns.
- https://github.com/LD4P/blacklight-cornell/blob/bang/app/assets/javascripts/bang/evalHub.js
  - This code looks at how many hub to hub relationships provide LCCNs or ISBNs. To avoid throttling issues, the code queried 500 hubs at a time. When bringing up the page that ran the code, we would set the starting hub number, effectively paging through the first 10,000 hubs returned from LOC.
  - Hubs were retrieved using this call "https://id.loc.gov/search/?q=cs:http://id.loc.gov/resources/hubs&count=" + this.sampleSize + start + "&format=json" where the sample size and starting hub number could be specified.
  - --parse button--
  - Related view: https://github.com/LD4P/blacklight-cornell/blob/bang/app/views/bang/eval_hubs/index.erb
- https://github.com/LD4P/blacklight-cornell/blob/bang/app/assets/javascripts/bang/evalHubAggregation.js
  - This code retrieves unique ISBNs for every hub that has more than 1 work. This code also uses the same sample size and paging approach as the code above.
  - Related view: https://github.com/LD4P/blacklight-cornell/blob/bang/app/views/bang/eval_hubs/same_hub.erb
- https://github.com/LD4P/blacklight-cornell/blob/bang/app/controllers/bang/eval_hubs_controller.rb
We wrote scripts to further analyze these groupings of hubs to see how many catalog matches we could get.
- LCCN analysis
  - Finding catalog matches for LCCN sets grouped under LOC Hubs
    - This file (HubSetsLccn.csv) lists an LOC Hub on each line followed by a list of LCCNs from instances that fall under that hub.
    - A script (processlccn.rb) reads in this file and then generates the file (lccnhubonlyfirst) which lists the LCCN rows that matched at least two catalog items, and then ends with a summary. (The output says "ISBN" but is in fact "LCCN" because the same code was copied/used the ISBN analysis).
  - Finding catalog matches for LCCN sets grouped under LOC Hub to Hub relationships
    - Each line in the file (prophublccnsets.csv) lists the name of the relationship (e.g. "hasTranslation") that links two different hubs, followed by the LCCNs that fall under those hubs.
    - A script (processrellccn.rb) reads in this file and then generates the file (lcchubrels) which starts with a list of the property and LCCN groups that resulted in at least two catalog matches (e.g. "hasTranslation : 2017328875,92911176,93910013") followed by a summary of the total number of rows and LCCNs in the original file and the number of matching rows/LCCNs. In addition, the file also lists those hub relationship and LCCN groupings from the original CSV file that resulted in exactly one match in the catalog. This latter piece of information was used for our POD analysis.
- ISBN analysis
  - Finding catalog matches for ISBN sets grouped under LOC Hubs
    - The script (processcsv.rb) analyzes the file (HubSets.csv), which lists LOC Hubs with the groups of ISBNs that fall under each hub, and generates a file (tenthousandresults). This resulting file first lists the sets of ISBNs from the original CSV where each set has at least two catalog matches. The file ends with a summary of the total number of rows processed from the original file and the rows that matched at least two catalog items (i.e. ISBN sets).
  - Finding catalog matches for ISBN sets grouped under LOC Hub to Hub relationships
    - The script (processrelcsv.rb) analyzes the file (prophubsets.csv) . This CSV file (you can sense a pattern now) which lists the property connecting two LOC Hubs followed by a list of ISBNs that fall under the two hubs related by this property. The analysis results in the file (updateHubRelResults) which lists the relationship and ISBN groups that result in at least two catalog matches. The file ends with a summary of total rows processed from the original file and the number of rows which resulted in two catalog matches.

PCC data analysis

Analysis of ISBNs aggregated under the same Opus
- Our first pass at queries had taken too long, so we broke the process up into separate portions.
  - First, we queried the PCC data using Dave's Fuseki server (or a copy of the data on our own Fuseki server) to retrieve a list of all Opera that had at least two works with instances with ISBNs. The query we used is captured here. Executing this query resulted in the following list of Opera URIs.
  - This script takes the list of Opera URIs and executes SPARQL queries to retrieve the ISBNs of any instances that correspond to different work URIs aggregated by that Opus. Running the script results in a file where each line has sets of ISBNs corresponding to an opus (Note that the script has the Fuseki SPARQL URL not included so running the script would require replace that part of the code with the Fuseki SPARQL URL you wish to query.)
  - Another script then takes the file with the ISBN groups to check which of these groups results in at least two catalog matches. The script outputs the ISBN groups that result in matches long with a summary (i.e. total number of rows, total number of matches, etc.) and a list of ISBN groups that didn't result in a match. We captured the part of the output that lists the matching ISBN groups here.

POD data analysis

Fuseki UI

Page tree

BANG! Scripts