Google like autosuggest / typeahead (suggesting keywords / phrases) with Solr
Asked Answered
C

1

7

Requirements

I need a google like suggestions in a search box. Solr is already a given. The results should look like this:

searchterm Alex
results Alexander Behling, Alexander Someone ...

searchterm cab
results cable, high voltage cable, cable cutter enter image description here The aim is to have phrases as suggestion and not entire fields or excerpts. The query should be caseinsensitive, Alex should have the same results as alex, but the searchresult (suggestions) must have the original case.
The suggestions must be filterable by category, we have the results of several domains in one index and the result should be filtered by a specific field containing the domain. contextField only works with "AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory."

I tried three approaches

1. Approach : FreeTextLookupFactory

config (no special schema changes): 
     <searchComponent name="suggest" class="solr.SuggestComponent">
        <lst name="suggester">
          <str name="name">default</str>
          <str name="lookupImpl">FreeTextLookupFactory</str> 
          <str name="dictionaryImpl">DocumentDictionaryFactory</str>
          <str name="field">content</str>
          <str name="ngrams">3</str>
          <str name="separator"> </str>
          <str name="suggestFreeTextAnalyzerFieldType">text_general</str>
        </lst>
    </searchComponent>

    <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
      <lst name="defaults">
        <str name="suggest">true</str>
        <str name="suggest.count">10</str>
        <str name="suggest.dictionary">default</str>        
        <str name="echoParams">explicit</str>
      </lst>
      <arr name="components">
         <str>suggest</str>
      </arr>
    </requestHandler>

This works reasonable well, but delivers only single words.
searchterm Alex
results Alexander, Alexandra ...
Advantage is a very high indexing speed. I tried to combine this with a ShingleFilter, but this didn't work, probably because the ShingleFilter is already part of the FreeTextLookupFactory. Because of the FreeTextLookupFactory categories are not supported.

2. Approach : BlendedInfixLookupFactory with custom field

schema:
<field name="suggest_field" type="text_suggest" indexed="true" stored="true" multiValued="true"/>
<field name="site" type="string" stored="true" indexed="true"/>
<copyField source="content" dest="suggest_field"/>

    <fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <!--filter class="solr.LowerCaseFilterFactory"/-->
                <filter class="solr.TrimFilterFactory"/>
                <filter class="solr.ShingleFilterFactory" 
                    minShingleSize="2"
                    maxShingleSize="4"
                    outputUnigrams="true"
                    outputUnigramsIfNoShingles="true"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.KeywordTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
           </analyzer>
    </fieldType>

config:
<searchComponent name="suggest" class="solr.SuggestComponent">
   <lst name="suggester">
      <str name="name">default</str>
      <str name="lookupImpl">BlendedInfixLookupFactory</str>
      <str name="blenderType">position_linear</str>
      <str name="dictionaryimpl">DocumentDictionaryFactory</str>
      <str name="field">suggest_field</str>
      <str name="weightField">weight</str>
      <str name="suggestAnalyzerFieldType">text_suggest</str>
      <str name="queryAnalyzerFieldType">phrase_suggest</str>
      <str name="indexPath">suggest</str>
      <str name="buildOnStartup">false</str>
      <str name="buildOnCommit">false</str>
      <bool name="exactMatchFirst">true</bool>
      <str name="contextField">site</str>
   </lst> 
</searchComponent>

    <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
      <lst name="defaults">
        <str name="suggest">true</str>
        <str name="suggest.count">10</str>
        <str name="suggest.dictionary">default</str>        
        <str name="echoParams">explicit</str>
      </lst>
      <arr name="components">
         <str>suggest</str>
      </arr>
    </requestHandler>FreeTextLookupFactory

The second approach leads to a for me strange behaviour:

searchterm Alex or alex
results nothing ...
searchterm cab
results cable, cables, voltage cables, cable accessories, power cables ...

Using the same fields, there are no search results for certain queries. The indexing speed is already > 12h for <10000 entries. Due to the BlendedInfixLookupFactory and DocumentDictionaryFactory categories should be supported. But when using a category in the query. http://localhost:8983/solr/magnolia/suggest?wt=json&suggest=true&suggest.q=nym&suggest.cfq=com the results are empty. The field "site" does contain the value "com" multiple times in the index.

3. Approach BlendedInfixLookupFactory with HighFrequencyDictionaryFactory and custom field

schema:

 <field name="suggest_field" type="text_shingle" indexed="true" stored="true" multiValued="true"/>
...
<copyField source="_text_" dest="suggest_field"/>
...
    <fieldType name="text_shingle" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
           <charFilter class="solr.HTMLStripCharFilterFactory"/>
           <filter class="solr.TrimFilterFactory"/>
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_suggestions.txt" format="snowball" />
           <!--filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15"/-->
           <filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="4" outputUnigrams="false" outputUnigramsIfNoShingles="true" fillerToken=""/>
        </analyzer>
    </fieldType>
    <!-- marc johnen : used for autocomplete-->
    <fieldType name="text_suggest" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
             <tokenizer class="solr.StandardTokenizerFactory"/>
             <filter class="solr.LowerCaseFilterFactory"/>
             <filter class="solr.TrimFilterFactory"/>
          </analyzer>
    </fieldType>

config:
    <searchComponent name="suggest" class="solr.SuggestComponent">
      <lst name="suggester">
        <str name="name">default</str>
        <str name="lookupImpl">BlendedInfixLookupFactory</str>
        <str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
        <str name="field">suggest_field</str>
        <str name="suggestAnalyzerFieldType">text_suggest</str>
        <str name="minPrefixChars">2</str>
        <str name="exactMatchFirst">true</str>
        <str name="buildOnStartup">false</str> 
        <str name="buildOnCommit">true</str>
        <str name="highlight">false</str>
      </lst>
    </searchComponent>

    <requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
      <lst name="defaults">
        <str name="suggest">true</str>
        <str name="suggest.count">10</str>
        <str name="suggest.dictionary">default</str>        
        <str name="echoParams">explicit</str>
      </lst>
      <arr name="components">
         <str>suggest</str>
      </arr>
    </requestHandler>

The results of this approach are quite good, basically as specified except for some duplicate phrases because some keywords are duplicated because they have whitespaces at the beginning or end like "power cable" and "power cable ". Other than that quite good.

searchterm Alex
results Alexander Behling, Alexander Someone ...

searchterm cab
results cable, high voltage cable, cable cutter

Indexing easily takes a day for <10000 documents. The main problem though is that because of the HighFrequencyDictionaryFactory categories are not supported.

Query

The query I use looks like this:

http://localhost:8983/solr/magnolia/suggest?wt=json&suggest=true&suggest.q=cab

Adding a <str name="contextField">site</str> in the config for categories and &suggest.cfq=com to the query when applicable.

Copyist answered 2/6, 2021 at 20:5 Comment(0)
C
1

I ended up using the FreeTextLookupFactory and created a separate field and suggester for each language.

Copyist answered 11/6, 2021 at 15:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.