To support the subject view in the prototype, which included information from LCSH, Wikidata, and PeriodO, we built a Solr index to enable faster access of this information. During LD4P3, we had used the Fuseki server provided by U of Iowa (thanks Dave Eichmann!) which contained the LCSH dataset to retrieve LCSH information. The server has 488,367 LCSH headings. Querying against the server, we obtained a TSV file with the URI and label of these headings and then created an index with this information. Possibly due to some errors with the script, our final index has a total of 421,220 entries where each Solr record represents a different LCSH URI.
Below, we provide an overview of the number of fields corresponding to different data sources in our index, and then provide examples of each of these types of data below.
Total number of records
Records with components (i.e. subheadings of any kind)
Records with LCSH geographic components
Records with LCSH temporal components (i.e. either a starting date or stopping date retrieved from LCSH)
Records with either a PeriodO start or stop date
Records with PeriodO spatial information
Records with Wikidata URIs for the subject heading
Records with Wikidata URIs for LCSH geographic components in the subject heading
Minimal entry: LCSH URI and Label
Here is the JSON for a minimal LCSH entry in this index:
Our script copied the label into two fields, “label_s” which allows for exact string matching and “label_t” which allows for partial text matching against the label. The URI is stored in the “uri_s” field.
LCSH Geographic Components
To this index, we also added component information for subject headings, identifying geographic and temporal components separately. We retrieved this information by executing additional queries against the Fuseki server to identify URIs for components with subject headings. Below is an example of a subject heading “Cathedrals–Europe” that has two subdivisions or components: “Cathedrals” and “Europe”. URIs and labels for both are stored in the “components_json_s” field. Because “Europe” corresponds to a geographic component (i.e. identified of Geographic type), we have also extracted the URI into the geo_uri_ss field and the label into the geo_label_ss field.
LCSH Temporal Components
Similar to our process for retrieving geographic information, we also extracted temporal components from LCSH headings to include in the index. We identified the labels representing centuries and hardcoded a list mapping those labels to particular start and end dates. For example, we mapped the LCSH URI <http://id.loc.gov/authorities/subjects/sh2002012476> with the label “20th century” to the start date 1900 and end date of 1999. The example below shows the temporal component included in the “components_json_s” field. The start and end years are represented in the “temp_start_i” and “temp_stop_i” fields respectively.
As we noted, we took the LCSH info mapped in PeriodO to retrieve temporal and geographic information for subject headings and corresponding Wikidata URIs for geographic places. We downloaded this dataset and then ran scripts to parse the data and add information to the index.
In the example below, “Burr conspiracy, 1805-1807” maps to a period_o entity with identifier "p06c6g399wq". This identifier could be used to generate the associated PeriodO link http://n2t.net/ark:/99152/p06c6g399wq .The PeriodO data gives us a start and stop date which we store in the “periodo_start_i” and “periodo_stop_i” fields respectively. PeriodO also gives us a Wikidata URI and label for spatial coverage for this subject heading, which we save in the “spatial_coverage_ss” and “spatial_coverage_label_ss” fields respectively.
Our scripts for populating this index also included queries for retrieving Wikidata URIs for the main subject heading as well as for LCSH geographic components or subdivisions. In addition, we added Wikidata URIs related to spatial information extracted from PeriodO.
Our PeriodO example we included above shows two examples of the inclusion of Wikidata URIs.We have highlighted the fields below:
"label_s":"Burr Conspiracy, 1805-1807",
"label_t":["Burr Conspiracy, 1805-1807"],
The subject heading "Burr Conspiracy, 1805-1807" has the URI <http://id.loc.gov/authorities/subjects/sh85018171> which relates to the Wikidata URI <http://www.wikidata.org/entity/Q2994776>. The Solr record above uses the field “wikidata_uri_s” to store the Wikidata URIs related to the main subject heading represented in that record. PeriodO information provides the label, “United States” and Wikidata URI, <http://www.wikidata.org/entity/Q30> , for the related location. The Solr record stores this PeriodO Wikidata URI in the “spatial_coverage_ss” field.
This example, which we had included to show geographic components retrieved from LCSH, also includes Wikidata URIs for those geographic components. For the LCSH heading “Cathedrals–Europe”, “Europe” is identified by the LCSH URI <http://id.loc.gov/authorities/subjects/sh85045631>. Querying Wikidata for a related URI for this heading gives us the URI <http://www.wikidata.org/entity/Q46> which the Solr record above stores in the “geo_wd_ss” field.