news, development

SOLR Search Request Handlers explained

With SOLR you can execute complex queries over your indexed documents.
Like with other software the possibilities are grown over time and there are many different configurations and parameters that could be used in order to specify a query in SOLR.
This post tries to summarize the main concepts and parameters and how they could be combined with the goal to get a more global picture of how SOLR and SOLRs search request handlers are working - and how powerful they can be.

(Take your time to read the article and have a SOLR Installation by hand if you want to run the examples. I hope this article helps to gain some more insides on how to search with SOLR.)

Request Handlers

Lets start from the most basic: In SOLR you have so called Request Handlers that are responsible to answer your request. All RequestHandlers for your SOLR Installation are configured in the solrconf.xml. RequestHandlers have a certain name and a class assigned that is responsible for handling the request. If the name starts with a "/" you can reach the request handler by calling the correct path.

For example the update Handler is configured like this:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

That means you can reach this handler by calling <your_solr_url>/update 

If the name is not starting with "/" you can by default call the request handler with the path select and the parameter qt like this: "/select&qt=standard"...

These kind of requestHandlers are normally reserved for handlers that searches for something (the so called "search Handlers"). 

You can have a look at your solrconf.xml for more details regarding the configured handlers or call the plugin handler like this: /admin/plugins  (if it is configured of course).

You can check the solrconfig.xml examples here: solrconfig.xml in solr SVN and the solrconfig.xml included in the TYPO3 solr extension (which is used in the examples below)

Search Handlers

So we can define the search handlers as SOLR request handlers that returns a list of results to you. Often they have names that do not start with "/" and therefore are used by specifiing the qt parameter in the url like ".../select&qt=<searchhandlername>"

The search Handlers normally uses different components that do parts of the work. For example there are components for Querying, Facetting, Highlighting etc.

All the search Handlers should understand the so called common query parameters [3]. The main ones are:

  • q: The query string that is parsed using a query parser
  • sort: For defining the sorting
  • start, rows - to define offest and result count
  • fq - to define filter queries - you can use multiples of them - and the results are cached. (The caching can also be disabled)
  • fl - specify the list of fields that should be returned for the matched documents. For the upcoming SOLR 4.0 this parameter has nice additional features - like pseudo fields, function results as field etc [7]
  • defType - specify the query parser
  • debugQuery - set this to "on" to see the parsed query and details for scoring calculation.


And additionally the used components "understand" his parameters.

Please also note, that some of these parameters (and all other parameters) can be configured with default values in the solrconf.xml.

Query Parsers:

The Solr.SearchHandler can be configured to use different query Parsers for "translating" the value of the "q" parameter to the correct Lucene Query. Therefore the query parser is one of the most important parts for the search handlers - and understanding whats happening behind the scenes is useful.

Unless configured otherwise the query parser used is the standard lucene query parser.

There are other parsers that are often used - like dismax and edismax.  All have its pro and cons. The parser that is used is defined with the defType parameter (either explicit or implicit by using a configured search handler).

Another way of specifying the parser that should be used is the LocalParams[1] syntax. We will have a look at some examples later.

Here is the full list of query parsers,  that allows you to build really fancy querys:

  • lucene - standard (the standard parser) - see below for details
  • dismax - aims to deal with a "human query string" - see below for details. (there is also the extended edismax parser in newer versions)
  • func - to build function queries. Not really useful as standalone parser - but often used together with others (_val_ hook for lucene parser and bf parameter for dismax parser)
  • boost - to boost a query
  • frange - can be used to speed up range queries
  • field - simple field query useful in filter querys
  • prefix - simple prefix query - useful e.g. in filter querys
  • raw / term - create raw term query from input
  • query - allow for combining different querytypes ("nesting")

 

Query Parsers at a closer look

the lucene query parser (standard)

The Lucene Query Parser understands a subset of Lucene Query (see [2])

Lets have a look at some first basic example:
/select?q=forum&rows=10&qt=standard&wt=standard&debugQuery=on

Please note 2 details here: first the debugQuery=on is used which includes the result of the query parsing together woth ranking details. And second the lucene parser is used by setting the requestHandler to standard ( query type: qt=standard ). The "standard" search request handler  is normaly defined to use the lucene parser, because of the following snippet in the solrconfig.xml. (It might be different for your installation)

<requestHandler name="standard" class="solr.SearchHandler">
    <!-- default values for query parameters -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
     </lst>
     <arr name="last-components">
       <str>spellcheck</str>
     </arr>
  </requestHandler>

An alternative way of specifing this query would have been to set the query parser with the defType parameter like this:
/select?q=forum&rows=10&wt=standard&debugQuery=on&defType=lucene

The resulting parsed query of this example is: "content:forum" meaning it will return all documents with the term "forum" in the content field.
The scoring of the documents is calculated mainly from termFrequency (=how often the term matches in the field) and fieldNorm (= the overall lenght of field)

Here is another example:
/select?q=forum billing&rows=10&defType=lucene&wt=standard&debugQuery=on

With the resulting Query: "+content:forum +content:billing" - note that both terms are marked as obsolete.

Examples - search in different fields:
You can use the lucene parser to explicitly search in some some fields. For easier  explanation the next examples only show the q parameter:
q=content:forum title:billing
Resulting Query: "+content:forum +title:billing"

So this example will return doc that match the term "forum" in the content and "billing" in the title. Note that the terms are automatically combined using "AND" (both are obsolete). If you dont want this you can set the parameter q.op to "OR" or explicitly write the query like:
q=content:forum OR title:billing
Resulting Query: "content:forum title:billing"

You can also boost certain term matches for the default score calculation like this:
q=+content:nokia +content:prepaid^10 title:billing
q.op=OR
Resulting Query: "+content:nokia content:prepaid^10.0 title:billing"

Guess what this query does? It will find all documents that match the term "nokia" - and it will additional boost up documents that matches the term "prepaid" in the content and "billing" in the title. Where the term prepaid is 10 times more relevant for the score calculation, but still documents are listed that don't have the term "prepaid" in it, since the term is not marked as obsolete.

Multiple terms in one field can also be written like this:
q=title:(nokia handy)
Resulting Query: "+title:nokia +title:hallo
"

Hooks and function queries:
The Standard query parser allows much more: like negative term queries, fuzzy search and range queries. Last but not least two "hooks" are available: _val_ and _query_ that allows the combined usage with function queries (_val_) and  the support for nested subqueries of any type (_query_).
An example query using _val_:
q=forum _val_:"{!func}id"
Resulting Query: "+content:forum +FunctionQuery(str(id))"

So this query will find documents with the term "forum" and it will add the id to the score value (of course there are better se-cases: like using a funcion query that adds value to the score based on a date field for example). We will see examples for nested queries later.


To summarize the pros of the lucene parser:

  • used to select on specific fields if you need fine grained control
  • enables powerful usage of Lucene Query Parser with a lot of features [2]
  • two hooks that allows more complex use cases

The dismax query parser

The dismax parser is designed to interpret user input like "nokia akku tips" or "smartphone nokia -iphone", also it has a lot of parameters to control the score calculation using PhraseQueries, Boostfunction etc.
Lets see some examples:
/select?q=forum&rows=10&qf=content&qt=dismax&wt=standard&debugQuery=on

Because we set the requestHandler with the qt parameter to "dismax" the request Handler configured as following is used:
 <requestHandler name="dismax" class="solr.SearchHandler" >
    <lst name="defaults">
     <str name="defType">dismax</str>
     <str name="echoParams">explicit</str>
    </lst>
  </requestHandler>

So in fact this will  use the dismax query parser without any predefined parameters and the query will expand to: "+(content:forum) ()". Pay attention to the extra parameter qf (query fields) that needs to have the list of fields to search in.
Another way of forcing the dismax query parser would have been to use LocalParams Syntax like this: q={!dismax}forum

Lets look at a second example (again only the q parameter is shown)
q=forum +must -dont
Resulting Query: "+( ( (content:forum) +(content:must) -(content:dont))~1) ()"

So the parser also supports the explicit absense or existence of a term. The "~1" in the parsed query means that only one of the expressions need to match (see the mm parameter below)

The dismax parser allows to search in multiple fields with different relevancy by setting the qf (query fields) parameter:
q=forum
qf=content^2 title^10

Resulting Query: "+(content:forum^2.0 | title:forum^10.0) ()"


The dismax parser comes with a couple of parameters that helps to translate the human query to a relevant lucene query [4] - lets have a closer look at some of them.

The mm (minimum should match) Parameter:
Here you define how many terms must match - lets say you search for "nokia sony panasonic", a mm set to "2" will mean that only 2 terms need to match (and it will find documents with only nokia and sony in it for example)
To make it more concrete look at this example:
/select?q=nokia sony panasonic&rows=10&qt=dismax&debugQuery=on&mm=2&qf=content^10 title^20

This will result in the parsed query:
  (
    (content:nokia^10.0 | title:nokia^20.0)
    (content:sony^10.0 | title:sony^20.0)
    (content:panasonic^10.0 | title:panasonic^20.0)
  )~2

This is where you see how powerful the dismax parser is - the parsed queries get soon very big - especially when using many queryfields and also using the other dismax parameters.

Other Parameters:
The dismax parser supports many more parameters - mainly to control the scoring:

  • Phrases (fo boosting documents where all terms match in close proximity)
    • pf - Phrase Fields (e.g. content^2)
    • ps - Phrase sloop (e.g. 15 )
  • bf - boost function, shorthand for _val_ hook. (Simelar like for the lucene parser)
  • bq - boost query - any additional query that is executed and added to the scoring

For more details see [4]

So the big advantage of the dismax parser is, that you have a good scoring calculation based on normal queries. (much better than a simple lucene parser result)

Boost Query Parsers:

Lets have a look at the boost Query Parser. You can use it using LocalParams Syntax like this:
q={!boost b=<the boost function query>}<any other query>

Note that the boost query itself set with the local param "b" and that the boost query needs always another "normal" querystring as well.

For example q={!boost b=id}forum
will result in the parsed Query: "boost(content:forum,str(id))"

The result means, that all documents that matches "forum" in the content field will be returned and additionally boosted by the id field (which is not very useful of course but serves as an example). You also see that the lucene parser is used for the appended query string "forum" - since this is the default parser.

If you want to use the boost query together with a dismax query you can do this like this:
q={!boost b=id}{!dismax}nokia sony panasonic
Resulting Query: "boost(+(((content:nokia^40.0 | title:nokia ....))~2 (content:"nokia sony panasonic"~15^2.0),str(id))"

(If you want to use this please note that the configured default parameters for the dismax parser are used - but you can also override them inline in the localParams syntax)

Function Query Parser

The last query parser for this article is the function query parser. This parser allows to use functions in your query - the normal use case is to influence the scoring in a way you would like to have it (e.g. boosting newer documents...)

Lets look at a simple example:

/select?q={!func}id&rows=10&debugQuery=on

This results in this query: str(id)

It will simply find all documents (a function query itself do not filter) and the scoring equals the fieldvalue of the field id.

Another example is this:

/select?q={!func}product(2,price_f)&debugQuery=on&fl=id,score,price_f&fq=id:2

The resulting query is: "product(const(1.0),sfloat(app_price_f))"

Please note that we used the parameter "fq" to filter for the docuemnt with the id 2 and we also displayed the "score" field: So this query will give us the doubled price of the document in the score field. Imagine you can use this to calculate and return fancy stuff for documents (like term frequency etc).

Typical usage of function query

Typically the function query is used together with other parsers: Most of the parsers have support for function querys:

  • lucene parser: using the _val_ hook
  • dismax parser: using the _val_ hook or the bf parameter

Example dismax and bf usage:
/select?q=nokia&debugQuery=on&qf=content title&qt=dismax&bf=id

Will result in: "+(content:nokia | title:nokia) () str(id)"

Combining dismax and lucene

With the above knowlege its easy to combine the two parsers in different ways:

Lucene in dismax:
Using the bq parameter of dismax to add lucene query:

/select?q=car&debugQuery=on&qf=content&qt=dismax&bq=pagetype:app

Resulting query: "+(content:car) () pagetype:app"

Dismax in lucene:

Using the _query_ hook for lucene parser to add a dismax query:

/select?q=pagetype:app _query_:"{!dismax%20qf=content}car"&debugQuery=on&qt=standard

Resulting query: "+pagetype:app +(+(content:car) ())"

The fq (filter query) parameter

If you go to the beginning of the article you see that the fq parameter is a parameter supported for all search handlers. There are some special things with this parameter:

  • You can also use different query parsers (using localparam syntax). The default (and the only one that makes really sense) is lucene
  • You can have multiple fq parameters in the url: all are evaluated
  • The result of the query is cached (can be disabled)
  • The filter query do not influence the scoring calculation therefore we can perform filtering without fearing to change the score value of individual documents

Its suggested to use the filter queries (like the name says) for facet flltering - the caching feature is more helpful for complex query parts.

The advantage is, that the filter query uses the filterCache from SOLR. It works something like this:

  1. all filter queries are executed seperate. If caching is not disabled the resulting document ids are cached. So if the same filter query is used again its not executed - but retrieved from the cache.
  2. Before the final result is returned, solr calculates the intersection between the main query result and all filter query results.

I still want to do some tests to see under which circumstances the filterCache and the intersection calculation maybe worse than a single main query. (But thats a topic of its own). In general using filter queries results in a better usage of the SOLR caching. See also [8]

Search components

"Search components enable a SearchHandler to chain together reusable pieces of functionality to create custom search handlers without writing code. " [6]

Search Components can be enabled for the search request handlers (per default most of them are already enabled), they are configured with certain parameters and normally modify the result by adding more informations. This article won't explain them in detail but for the completeness the most important ones are listed here:

  • FacetComponent - Adds informations that can be used to show filters (facets) with the search result.
  • Highlighting - Adds preview snippets for the result documents based on the query
  • StatsComponent - similar to FacetComponents but only returns infos like min and max values of a certain field. (Useful to display range filters)
  • SpellCheckComponent - advanced spell checking for the query
  • QueryElevantionComponent - adds or boost documents based on a editorial maintained file
  • TermVectorComponent - can return infos like frequent terms in a field ( if the field is configured to store termVectors)
  • TermsComponent - provide access to the indexed terms in a field - often used for autosuggest

blog comments powered by Disqus
blogroll