news

Finding more with SOLR

For a recent project we launched we had to deal with very different quality of the indexed documents and I want to share some approaches that you can use with SOLR to match fuzzy on terms (for different languages).

For example: If someone searches for "Autonavigation" it should find documents with "car navigation" and "auto navigationsgerät".

There are the follwoing Issues to solve here:

  1. You need to split the word "Autonavigation" into "Auto" and "navigation"
  2. You need to search for the translated, synonyms and untranslated tokens.
  3. Be fuzzy enough
  4. Be relevant

Using DictionaryCompoundWordTokenFilterFactory to split nouns

A stable way to split words by meaningful subwords you can use the solr.DictionaryCompoundWordTokenFilterFactory Filter. This Filter uses a dictionary of words and splits tokens by detected words:

 

<!-- 1 split subwords english nouns -->
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="wordlists/german-common-nouns.txt" minWordSize="5" minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>
<!-- 2 split subwords german nouns -->
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="wordlists/english-common-nouns.txt" minWordSize="5" minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>

Using SynonymFilterFactory to translate

 

<filter class="solr.SynonymFilterFactory" synonyms="wordlists/translationsgroups-english-german-nouns.txt" ignoreCase="true" expand="true"/>

The file has synonymgroups like: "auto,car" - so that the search uses always the english and german version of a noun

Using the correct Stemmer

Try out different stemmer and analyse the results using the SOLR admin->analyse GUI. For german the Stemmer order (from aggressive to unaggressive) should be like this for example:

 

  • GermanStemFilterFactory
  • SnowballPorterFilterFactory (language German)
  • GermanLightStemFilterFactory
  • GermanMinimalStemFilterFactory

Be relevant

Using a aggressive stemmer and the explained token expansions you will have more hits when you search. But there is the risk to find documents that are not so relevant (the old recall to precision problem).

Therefore I prefer to have a "text" field that is configured less aggressive and dont uses language expansion. In addition you could configure a "expandedtext" field, that uses the described configuration.

When doing search you need to place a query that searches in both fields, but having a higher boost set to the "text" field.

 

<fieldType name="expandedtext" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_de.txt"
enablePositionIncrements="true"
/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords_en.txt"
enablePositionIncrements="true"
/>
<!-- 1 split subwords english nouns -->
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="wordlists/german-common-nouns.txt"
minWordSize="5" minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>
<!-- 2 split subwords german nouns -->
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="wordlists/english-common-nouns.txt"
minWordSize="5" minSubwordSize="4" maxSubwordSize="15" onlyLongestMatch="true"/>
<!-- 3 expand english words to include the german translation -->
<filter class="solr.SynonymFilterFactory" synonyms="wordlists/translationsgroups-english-german-nouns.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GermanStemFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.GermanStemFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

Downloads:

Actually finding and creating the correct dictionaries ad synonymlists takes most of the time. Here are the lists that I created based on the links at the end of the article. The translation was created using google translate.

Links for wordlists

Open Source Wordlists:
sourceforge.net/projects/germandict/
http://wordlist.sourceforge.net/

Dictionaries that can be used by open office:
wiki.services.openoffice.org/wiki/Dictionaries

List of most used words in different languages:
en.wiktionary.org/wiki/Wiktionary:Frequency_lists

German synonyms:
www.openthesaurus.de/about/download

Other lists of lists:
www.dict.org/w/databases/dict
fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html

SOLR related links

List of all Filters:
lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.html

Other Links

Nice blog around text technologies:
www.texttechnologies.com

"Institute der Deutschen Sprache": www.ids-mannheim.de/kl/projekte/methoden/derewo.html

Solr suggestion of Stemmers:
wiki.apache.org/solr/LanguageAnalysis

Open Source text processing software GATE:
gate.ac.uk

Open Source linguistic text processing LingPipe:
alias-i.com/lingpipe/

Semantic Indexing
knowledgesearch.org
www.cs.washington.edu/research/textrunner/reverbdemo.html

blog comments powered by Disqus
blogroll