ALL HOW TOs






References



Terminology


TermDefinitionComments
Solr Instance
  • multiple instances can run ('multiple solr instances are running')
  • deploy webapp on multiple servers, each of which is an instance
Solr Core
  • each solr instance can have multiple cores
  • also referred to as Solr Index, or simply Core or Index
  • implemented in a databases
  • generally, each core runs in isolation, but can configure some communication between cores via CoreContainer
Document
  • 0..m documents live in a core
  • basic unit of information
Field
  • 0..m fields live in a document
  • various types:  text, numeric, date, etc.
  • type tells solr how to interpret the field and how it can be queried
  • type: String stores a word/sentence as an exact string without performing tokenization etc. Commonly useful for storing exact matches, e.g, for facetting.

  • type: Text typically performs tokenization, and secondary processing (such as lower-casing etc.). Useful for all scenarios when we want to match part of a sentence.

Facet








Indexing Documents


  • index via...
    • Request Handlers & Update Handlers (via HTTP POST/PUT)
      • default:  XML, Binary, JSON, CVS, etc.
      • can define own handlers in config
    • Index Handlers
      • import from databases
    • Solr Cell framework (???)
    • custom Java application to ingest data through Solr's Java Client and other apps
  • update processors
    • signature
    • logging
    • indexing




Request Handlers


<!--  solr.SearchHandler  -->
<requestHandler name="standard" class="solr.SearchHandler">               <!-- /select -->
<requestHandler name="search" class="solr.SearchHandler" default="true">
<requestHandler name="permissions" class="solr.SearchHandler" >
<requestHandler name="document" class="solr.SearchHandler" >

<!--  solr.UpdateRequestHandler  -->
<requestHandler name="/update" class="solr.UpdateRequestHandler"  />

<!--  other handlers  -->
<requestHandler name="/replication" class="solr.ReplicationHandler" startup="lazy" />
<requestHandler name="/analysis/field" startup="lazy" class="solr.FieldAnalysisRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<requestHandler name="/admin/ping" class="solr.PingRequestHandler">


To see what a requestHandler returns, change the value of qt from /select to the name of the handler in the solr admin Query page (https://cals-che-repo-dev.library.cornell.edu/solr/#/development/query).  NOTE: You will need to change the host to your solr admin host and may need to change the name of the core from development to the name or your core.






Querying


  • receive XML, JSON, CSV, or binary (via HTTP GET)
  • request handlers (via HTTP GET)
    • default:  /admin, /select, /spell
    • can define own handlers in config
  • search components
    • query
    • spelling
    • faceting
    • highlighting
    • statistics
    • debug
    • clustering
  • search process  (see Common Query Parameters)


    descriptiondefaultexample
    qtselects Request Handler for a query using /selectDisMaxRequestHandler
    defTypeselects a Query Parser for the queryparser configured in Request Handler
    qfield_name:field_value with * as wildcard to search for*:*q=title:*Archery*
    fqfilters query by applying an additional query to the initial query's results, caches the results (same syntax as q)*:*fq=popularity:[10TO*]& fq=section:0
    sortsort fieldscore desc
    startan offset into the query results where the returned response should begin0start=0
    rowsthe number of rows to be displayed at one time10rows=20
    flfields to return in resultallfl=id, name
    dfdefault field name (I think) that indicates field to serchall indexed fieldsdf=description
    wtselects a Response Writer for formatting the query responsexml | jsonwt=json
    qflist of fields and the "boosts" to associate with each of them when building DisjunctionMaxQueries  (see also SOLR df and qf explanation)all indexed fields are required (???)
    qf=title^20 description^10








Features


  • High Level
    • Advanced Full-Text Search
    • Optimized for High Volume Web Traffic
    • Standards Based Open Interfaces - XML, JSON, HTTP
    • Comprehensive HTML Admin Interfaces
    • Service statistics exposed over JMX for monitoring
    • Near Real-time indexing and Adaptable with XML configuration
    • Linearly scalable, auto index replication, auto, extensible plugin architecture
  • Specific Features
    • faceting
    • highlighting
    • spell checking
    • query-re-ranking
    • transforming
    • suggestors
    • more like this
    • pagination
    • grouping & clustering
    • spatial search
    • components
    • real time (get & update)
    • labs






Configuration


  • schema.xml
    • field types
    • etc.
  • solrconfig.xml
    • register Request Handlers for querying the index
    • register Update Handlers for indexing documents
    • register Event Handlers for searcher events (e.g. queries to execute to warm new searches)
    • activate version-dependent features in Lucene
    • Lib directives indicates where Solr can find JAR files for extensions
    • Index management settings
    • Enable JMX instrumentation of Solr MBeans
    • Cache-management settings
  • solr.xml
  • core.properties




Fields


Defined in schema.xml


Hydra Types: 


defined by <types><fieldType>...</></>


  • t - text  (tokenized)
  • te - english text  (tokenized)
  • s - string
  • i - integer;  it - trie integer
  • f - float;  ft - trie float
  • l - long;  lt - trie float
  • d - double;  dt - trie double
  • b - boolean
  • dt - date;  dtt - trie date
  • ll - location; _coordinate - trie double to index lat and long of a location with indexed=true/stored=false


NOTE: letter indicates the postfix indicator that sets the type for Hydra dynamic fields.  Ex. name_tsi means that name has type="text"


Hydra Field Def Parameters:


defined by <fields><dynamicField>...</></>


  • s - stored="true|false" - if true, value is returned in solr document
  • i - indexed="true|false" - if true, value is searchable
  • m - multiValued="true|false" - if true, can have multiple values
  • v - termVectors="true|false" - ???
  • v - termPosition="true|false" - ???
  • v - termOffsets="true|false" = ???


NOTE: letter indicates the postfix indicator that sets that to true for Hydra dynamic fields.  Ex. name_tsi means that name has stored=true/indexed="true"


Examples for values of stored and indexed:


stored="true" indexed="false"

  • destination URL
  • file system path
  • time stamp
  • icon image
  • sort string - have a name that is tokenized text with stored=false/indexed=true and this field that is the exact string for sorting




stored="false" indexed="true"

  • bag of words - want to be able to search for all terms in the bag, but don't want them in the solr document search results
  • common misspellings - allow common misspellings to match in search, but don't include in solr document search results




indexed="false" stored="false"

  • Use this when you want to ignore fields. For example, the following will ignore unknown fields that don't match a defined field rather than throwing an error by default.

    <fieldtype name="ignored" stored="false" indexed="false" />
    <dynamicField name="*" type="ignored" />








Solr Cloud Features


  • horizontal scaling (for sharding and replication)
  • elastic scaling
  • high availability
  • distributed indexing
  • distributed searching
  • central configuration for entire cluster
  • automatic load balancing
  • automatic failover for queries
  • zookeeper integration for coordination & configurations




CRUD


Create




Read


Return all results with search term = "book"


Query for search term
http://localhost:8983/solr/#/development/select?q=book


Update




Delete


NOTE: Examples use stream.body to show how to do this through a URL.  Usually done via HTTP POST.


Delete by ID
http://localhost:8983/solr/#/development/update?stream.body=<delete><id>SOLR1000</id></delete>
http://localhost:8983/solr/#/development/update?stream.body=<commit/>




Delete by Query
http://localhost:8983/solr/#/development/update?stream.body=<delete><query>cat:software</query></delete>
http://localhost:8983/solr/#/development/update?stream.body=<commit/>




Steps to delete all via Solr Admin UI


  • In Solr UI, select core to effect from selection box on left side menu
  • select Documents on left side menu
  • set Document Type = XML
  • set Doucment(s) text area to `<delete><query>*:*</query></delete>`
  • leave commit within and overwrite as defaults
  • Submit








  • More Query Examples


Search for a specific field, category, containing a search term, book


Query for search term in a specific field
http://localhost:8983/solr/#/development/select?q=category:book




Search for price between 0 and 400, inclusive


Search for range of values
http://localhost:8983/solr/#/development/select?q=price:[0 TO 400]




Limit search results to return only fields id, name, and price.


Query for search term & limit fields returned
http://localhost:8983/solr/#/development/select?q=book&fl=id,name,price




Return facets for a specific field, category, with counts for each value of category based on the search results.


Query for search term & limit fields returned & include facets
http://localhost:8983/solr/#/development/select?q=book&fl=id,name,price&facet=on&facet.field=category


Partial Response  as relates to returned facet information.


Response
<lst name="facet_counts">
  <lst name="facet_queries" />
  <lst name="facet_fields">
    <lst name="category">
      <int name="book">10</int>
      <int name="video">2</int>
      <int name="audio">2</int>
    </lst>
  </lst>
  <lst name="facet_dates"/>
</lst>




Return facets for a specific field, category, with specific value for category, book, with counts for each value of category based on the search results.


Query for search term & limit fields returned & include facets
http://localhost:8983/solr/#/development/select?q=book&fl=id,name,price&facet=on&facet.field=category&fq=category:electronics


Partial Response  as relates to returned facet information.


Response
<lst name="facet_counts">
  <lst name="facet_queries" />
  <lst name="facet_fields">
    <lst name="category">
      <int name="book">10</int>
      <int name="video">0</int>
      <int name="audio">0</int>
    </lst>
  </lst>
  <lst name="facet_dates"/>
</lst>


NOTE: Can include multiple filter queries (fq).


NOTE: When filter query is applied, all categories are still listed, but now have 0 for count if they don't include the filtered value.







  • No labels