2020-01-10 Chicago report will be discussed in DAG meeting on Jan 21 (expect to happen as scheduled). Follow up with David and Astrid to look at what we could apply from what they learned and what additional user studies they could help us with
DAG presentation did occur and it was very interesting (link to meeting) – in depth interviews to understand research needs with open-ended questions, need more thought to understand how we might apply lessons from this
Dave made some changes to address issues with 50x responses, still possible issues under high load (possibly something on the Sinopia side but waiting to look at IP address to logging to identify whether same source is hitting us with same request)
Ongoing discussion about the need build more complex indexes to deal with slowness of complex queries
2020-01-31 - Dave, Steven, Lynette had a meeting this week. Created a list (HERE) . Dave is still optimistic about perf improvements in SPARQL with change from CONSTRUCT to SELECT, but not sure when Dave will be able to try this. Also looking at indexing approaches with smaller sources than SVDE, will try LOC which is expected to take 3 days. Will also work to cache context when needed and not to request it when it isn't needed. Steven noted this on #authorities channel. Three categories of approaches: 1) amount of extended context, 2) efficiency of queries, 3) scalability of requests. Longer term there are questions of lookup vs. autocomplete modes
See items under travel/conferences
Status updates and planning
John has attended to D&A meetings with user reps, showed video searching Google Books to deal with the zero results case which they found interesting. Adam notes that much D&A work is quite low-level and we don't have a good way to think about the big ideas.
Could we have a server for labs/beta? Might want to have a carefully chosen selection of the most promising features we develop
Maybe bring up idea of beta server at March meeting? Or do it ahead of time so people can play. ACTION - Simeon to ask Adam to work out cost of a similar server, and devops work we would need
John had discussion with HT about API which is on their to-do but not scheduled. May be open to providing index access
There is investigation but not sure whether it will result in something we can use
Steven: Investigated why not all entities created by the catalogers aren’t showing up in the Search Tab. Turns out… because embedded templates don’t produce URIs for those entities, they are not indexed as separate entities for reuse. Only the “primary” enitity gets a URI and is a result in the search. Eventually we can decide whether to de-embed templates and as catalogers to copy and paste URIs between Sinopia templates so everything is searchable, or wait for Sinopia to improve on this.
John looking at the cases of queries that return small numbers of bad search results. Should "a secret history" return "the secret history", especially if there a few matches from the default search? John will continue to explore...
John: Followed up with HathiTrust and they may have Solr results and/or access experiment in a few weeks
ANNIF update: Annif requires a vocabulary and then training data to enable suggestion retrieval based on input text or an input document. Worked through their tutorial/documentation (https://github.com/NatLibFi/Annif-tutorial/ and https://github.com/NatLibFi/Annif) to setup LCSH vocab and training documents based on our Solr index. Vocab: Retrieved all LCSH pref labels to URI matches from Dave's LCSH SPARQL endpoint (excluding any blank nodes). Training documents: First queried solr index for 10000 documents looking for full title display and subject display fields, set up script to go through documents and query text of subject field against id.loc.gov to retrieve URIs. Resulting training set had 8432 rows (each row is tab delimited title then followed by whatever subject URIs correspond). Loading in vocab and training documents, annif can be asked through command line or through REST API for subject suggestions based on input text/query. Tried that out with a few keywords and could see some results. Plan next on (a) integrating REST API with data as it stands into the subject/person suggestions UI, (b) increasing size of training document set and (c) looking into what it would take to set up the ensemble option which allows for integrating multiple text analysis/classification strategies.
Additionally, Annif's own site includes Wikidata (English) suggestions. John may look at these.
Tim demonstrates auto-suggest based on a small specialized index. Separates authors, locations, subjects, etc.. Would probably require a specialized index to support at scale
John working on getting synonyms from wikidata with the hope of addressing the zero-results scenario. Feel that current wikidata data is rather limited in its understanding of synonyms
Huda demonstrates related subject and people. Does search in LCSH, facet values based on the query and also calling annif (set up locally with 10k records from our catalog with title and subjects, and entire LCSH vocab. Will work to add more data into annif, will try to get FAST info from Dave, and also avoid empty boxes showing with no query
Plan to wrap up work next week, then move on to user tests, write up and video
Open meeting March 3, 2-3:30pm in Mann 102 and should Zoom it too
How will we decide what to take forward from KAPOW!, BAM! and SMASH!? (or as Tim put it, "what happens in late February?")
Briefly discussed how there is probably enough content between user research, larger questions, and experiments/development and prototyping to fit into two presentations. Tim suggested using the larger questions to frame discussion of experiments/prototyping.