Search with various combinations of space, hyphen, casing and punctuations

Asked 21/4, 2015 at 21:14 Answered 14/5, 2015 at 4:49

Solved solr lucene string-matching solrj textmatching

My schema:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory"
            generateWordParts="1" generateNumberParts="1"
            catenateWords="1" catenateNumbers="1" catenateAll="0"
            splitOnCaseChange="1" splitOnNumerics="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English"
            protected="protwords.txt"/>
  </analyzer>
</fieldType>

Combinations that I want to work:

"Walmart", "WalMart", "Wal Mart", "Wal-Mart", "Wal-mart"

Given any of these strings, I want to find the other one.

So, there are 25 such combinations as given below:

(First column denotes input text for search, second column denotes expected match)

(Walmart,Walmart)
(Walmart,WalMart)
(Walmart,Wal Mart)
(Walmart,Wal-Mart)
(Walmart,Wal-mart)
(WalMart,Walmart)
(WalMart,WalMart)
(WalMart,Wal Mart)
(WalMart,Wal-Mart)
(WalMart,Wal-mart)
(Wal Mart,Walmart)
(Wal Mart,WalMart)
(Wal Mart,Wal Mart)
(Wal Mart,Wal-Mart)
(Wal Mart,Wal-mart)
(Wal-Mart,Walmart)
(Wal-Mart,WalMart)
(Wal-Mart,Wal Mart)
(Wal-Mart,Wal-Mart)
(Wal-Mart,Wal-mart)
(Wal-mart,Walmart)
(Wal-mart,WalMart)
(Wal-mart,Wal Mart)
(Wal-mart,Wal-Mart)
(Wal-mart,Wal-mart)

Current limitations with my schema:

1. "Wal-Mart" -> "Walmart",
2. "Wal Mart" -> "Walmart",
3. "Walmart"  -> "Wal Mart",
4. "Wal-mart" -> "Walmart",
5. "WalMart"  -> "Walmart"

Screenshot of the analyzer:

Analyzer screenshot using initial schema

I tried various combinations of filters trying to resolve these limitations, so I got stumbled by the solution provided at: Solr - case-insensitive search do not work

While it seems to overcome one of the limitations that I have (see #5 WalMart -> Walmart), it is overall worse than what I had earlier. Now it does not work for cases like:

(Wal Mart,WalMart), 
(Wal-Mart,WalMart), 
(Wal-mart,WalMart), 
(WalMart,Wal Mart)
besides cases 1 to 4 as mentioned above

Analyzer after schema change: enter image description here

Questions:

Why does "WalMart" not match "Walmart" with my initial schema ? Solr analyzer clearly shows me that it had produced 3 tokens during index time: wal, mart, walmart. During query time: It has produced 1 token: walmart (while it's not clear why it would produce just 1 token), I fail to understand why it does not match given that walmart is contained in both query and index tokens.
The problem that I mentioned here is just a single use-case. There are more slightly complex ones like:

Words with apostrophes: "Mc Donalds", "Mc Donald's", "McDonald's", "Mc donalds", "Mc donald's", "Mcdonald's"

Words with different punctuations: "Mc-Donald Engineering Company, Inc."

In general, what's the best way to go around modeling the schema with this kind of requirement ? NGrams ? Index same data in different fields (in different formats) and use copyField directive (https://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields) ? What are the performance implications of this ?

EDIT: The default operator in my Solr schema is AND. I cannot change it to OR.

Arsenide answered 21/4, 2015 at 21:14 Comment(0)

Upgrading the Lucene version (4.4 to 4.10) in solrconfig.xml fixed the problem magically! I do not have anymore limitations and my query analyzer behaves as expected too.

Arsenide answered 14/5, 2015 at 4:49 Comment(1)

Going from 4.4 to 4.10 is an Upgrade ... :) – Elkins 18/5, 2015 at 5:0

We considered hyphenated words as a special case and wrote a custom analyzer that was used at index time to create three versions of this token, so in your case wal-mart would become walmart, wal mart and wal-mart. Each of these synonyms were written out using a custom SynonymFilter that was initially adapted from an example in the Lucene in Action book. The SynonymFilter sat between the Whitespace tokenizer and the Lowercase tokenizer.

At search time, either of the three versions would match one of the synonyms in the index.

Vair answered 22/4, 2015 at 0:58 Comment(5)

Thanks for taking time to answer. SynonymFilters would work if I had a good data set of synonyms which is unfortunately not true in my case. – Arsenide 22/4, 2015 at 1:14

Wouldn't it be possible to scan your index for hyphenated words and work with them? That may not be perfect but its a start. – Vair 23/4, 2015 at 2:23

hyphen is just one scenario of the several. There are other kinds of punctuations as well. I am afraid we could even scale with such special cases :) – Arsenide 23/4, 2015 at 17:0

I am combining your answer with femtoRgon's and that's going to be exactly what I want. Do you mind pointing to an example on writing custom SynonymFilter and how that is used in the Custom Analyzer ? – Arsenide 17/6, 2015 at 0:38

The example I based my code on can be found in Lucene in Action (Section 4.6) - the Lucene version here is 3.x I believe, which was what our original code was written against also. The code will need to be updated for version 4.x since there were changes to the Analysis API between 3.x and 4.x. – Vair 18/6, 2015 at 2:58

Why does "WalMart" not match "Walmart" with my initial schema?

Because you have defined the mm parameter of your DisMax/eDismax handler with a too high value. I have played around with it. When you define the mm value to 100% you will get no match. But why?

Because you are using the same analyzer for query and index time. Your search term "WalMart" is separated into 3 tokens (words). Namely these are "wal", "mart" and "walmart". Solr will now treat each word individually when counting towards the <str name="mm">100%</str>*.

By the way I have reproduced your problem, but there the problem occurs when indexing Walmart, but querying with WalMart. When performing it the other way around, it works fine.

You can override this by using LocalParams, you could rephrase your query like this {!mm=1}WalMart.

There are more slightly complex ones like [ ... ] "Mc Donald's" [ to match ] Words with different punctuations: "Mc-Donald Engineering Company, Inc."

Here also playing with the mm parameter helps.

In general, what's the best way to go around modeling the schema with this kind of requirement?

Here I agree with Sujit Pal, you should go and implement an own copy of the SynonymFilter. Why? Because it works differently from the other filters and tokenizers. It creates tokens inplace the offset of the indexed words.

What inplace? It will not increase the token count of your query. And you can perform the back hyphenation (joining two words that are separated by a blank).

But we are lacking a good synonyms.txt and cannot keep it up-to-date.

When extending or copying the SynonymFilter ignore the static mapping. You may remove the code that maps the words. You just need the offset handling.

Update I think you can also try the PatternCaptureGroupTokenFilter, but tackling company names with regular expressions may soon face its' limits. I will have a look into this later.

* You can find this in your solrconfig.xml, have a look for your <requestHandler ... />

Elkins answered 11/5, 2015 at 13:45 Comment(0)

I'll take the liberty of first making some adjustments to the analyzer. I'd consider WordDelimiterFilter to be functionally a second-step tokenization, so let's put it right after the Tokenizer. After that, there is no need to maintain case, so lowercase comes next. That's better for your StopFilter, since we don't need to worry about the ignorecase anymore. Then add the stemmer.

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
        words="stopwords.txt"
        enablePositionIncrements="true"
        />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

All in all, this isn't too far off. The main problem is "Wal Mart" vs "Walmart". For each of these, WordDelimiterFilter has nothing to do with it, it's the tokenizer that is splitting here. "Wal Mart" gets split by the tokenizer. "Walmart" never gets split, since nothing can reasonably know where it should be split up.

One solution for that would be to use KeywordTokenizer instead, and let WordDelimiterFilter do all of the tokenizing, but that'll lead to other problems (particularly, when dealing with longer, more complex text, like your "Mc-Donald Engineering Company, Inc." example will be problematic).

Instead, I'd recommend a ShingleFilter. This allows you to combine adjacent tokens into a single token to search on. This means, when indexing "Wal Mart", it will take the tokens "wal" and "mart" and also index the term "walmart". Normally, it would also insert a separator, but for this case, you'll want to override that behavior, and specify a separator of "".

We'll put the ShingleFilter at the end now (it'll tend to screw up stemming if you put it before the stemmer):

<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
        words="stopwords.txt"
        enablePositionIncrements="true"
        />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" tokenSeparator=""/>

This will only create shingle of 2 consecutive tokens (as well as the original single tokens), so I'm assuming you don't need to match up more than that (if you were to need "doremi" to match "Do Re Mi", for instance). But for the examples given, this works in my tests.

Sharonsharona answered 22/4, 2015 at 8:28 Comment(6)

While this removes the limitation for Walmart --> Wal Mart case, it's worse overall as these 3 cases which pass earlier fail: Wal-Mart -> Wal Mart, Wal-mart -> Wal Mart, WalMart -> Wal Mart. Also for the other use-case of McDonald's, these cases will fail as well: McDonald's -> Mc Donald's, McDonald's -> Mc Donalds, McDonald's -> Mc donald's, McDonald's -> Mc donalds – Arsenide 22/4, 2015 at 16:57

Did you reindex after making changes to the analyzer? – Sharonsharona 22/4, 2015 at 17:2

I started on a clean slate, restarted Solr and reran my tests (which do indexing followed by querying). – Arsenide 22/4, 2015 at 17:32

Don't know what to tell you. Sounds like a mismatched analyzer somewhere. I tried a number of those cases exactly, and they work for me. – Sharonsharona 22/4, 2015 at 17:43

May I know what version of Solr you are using ? Also Lucene version if that matters ? – Arsenide 22/4, 2015 at 17:59

Can you also please share your schema file and Solr settings if you don't mind me asking ? – Arsenide 22/4, 2015 at 18:8

Upgrading the Lucene version (4.4 to 4.10) in solrconfig.xml fixed the problem magically! I do not have anymore limitations and my query analyzer behaves as expected too.

Arsenide answered 14/5, 2015 at 4:49 Comment(1)

Going from 4.4 to 4.10 is an Upgrade ... :) – Elkins 18/5, 2015 at 5:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags