Requirements
I need a google like suggestions in a search box. Solr is already a given. The results should look like this:
searchterm Alex
results Alexander Behling, Alexander Someone ...
searchterm cab
results cable, high voltage cable, cable cutter
The aim is to have phrases as suggestion and not entire fields or excerpts. The query should be caseinsensitive, Alex should have the same results as alex, but the searchresult (suggestions) must have the original case.
The suggestions must be filterable by category, we have the results of several domains in one index and the result should be filtered by a specific field containing the domain. contextField only works with "AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory."
I tried three approaches
1. Approach : FreeTextLookupFactory
config (no special schema changes):
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">default</str>
<str name="lookupImpl">FreeTextLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">content</str>
<str name="ngrams">3</str>
<str name="separator"> </str>
<str name="suggestFreeTextAnalyzerFieldType">text_general</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">default</str>
<str name="echoParams">explicit</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
This works reasonable well, but delivers only single words.
searchterm Alex
results Alexander, Alexandra ...
Advantage is a very high indexing speed. I tried to combine this with a ShingleFilter, but this didn't work, probably because the ShingleFilter is already part of the FreeTextLookupFactory. Because of the FreeTextLookupFactory categories are not supported.
2. Approach : BlendedInfixLookupFactory with custom field
schema:
<field name="suggest_field" type="text_suggest" indexed="true" stored="true" multiValued="true"/>
<field name="site" type="string" stored="true" indexed="true"/>
<copyField source="content" dest="suggest_field"/>
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!--filter class="solr.LowerCaseFilterFactory"/-->
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
config:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">default</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="blenderType">position_linear</str>
<str name="dictionaryimpl">DocumentDictionaryFactory</str>
<str name="field">suggest_field</str>
<str name="weightField">weight</str>
<str name="suggestAnalyzerFieldType">text_suggest</str>
<str name="queryAnalyzerFieldType">phrase_suggest</str>
<str name="indexPath">suggest</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
<bool name="exactMatchFirst">true</bool>
<str name="contextField">site</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">default</str>
<str name="echoParams">explicit</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>FreeTextLookupFactory
The second approach leads to a for me strange behaviour:
searchterm Alex or alex
results nothing ...
searchterm cab
results cable, cables, voltage cables, cable accessories, power cables ...
Using the same fields, there are no search results for certain queries. The indexing speed is already > 12h for <10000 entries. Due to the BlendedInfixLookupFactory and DocumentDictionaryFactory categories should be supported.
But when using a category in the query. http://localhost:8983/solr/magnolia/suggest?wt=json&suggest=true&suggest.q=nym&suggest.cfq=com
the results are empty. The field "site" does contain the value "com" multiple times in the index.
3. Approach BlendedInfixLookupFactory with HighFrequencyDictionaryFactory and custom field
schema:
<field name="suggest_field" type="text_shingle" indexed="true" stored="true" multiValued="true"/>
...
<copyField source="_text_" dest="suggest_field"/>
...
<fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_suggestions.txt" format="snowball" />
<!--filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15"/-->
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="4" outputUnigrams="false" outputUnigramsIfNoShingles="true" fillerToken=""/>
</analyzer>
</fieldType>
<!-- marc johnen : used for autocomplete-->
<fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>
config:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">default</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
<str name="field">suggest_field</str>
<str name="suggestAnalyzerFieldType">text_suggest</str>
<str name="minPrefixChars">2</str>
<str name="exactMatchFirst">true</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">true</str>
<str name="highlight">false</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">default</str>
<str name="echoParams">explicit</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
The results of this approach are quite good, basically as specified except for some duplicate phrases because some keywords are duplicated because they have whitespaces at the beginning or end like "power cable" and "power cable ". Other than that quite good.
searchterm Alex
results Alexander Behling, Alexander Someone ...
searchterm cab
results cable, high voltage cable, cable cutter
Indexing easily takes a day for <10000 documents. The main problem though is that because of the HighFrequencyDictionaryFactory categories are not supported.
Query
The query I use looks like this:
http://localhost:8983/solr/magnolia/suggest?wt=json&suggest=true&suggest.q=cab
Adding a <str name="contextField">site</str>
in the config for categories and &suggest.cfq=com
to the query when applicable.