My schema:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0"
splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
</analyzer>
</fieldType>
Combinations that I want to work:
"Walmart", "WalMart", "Wal Mart", "Wal-Mart", "Wal-mart"
Given any of these strings, I want to find the other one.
So, there are 25 such combinations as given below:
(First column denotes input text for search, second column denotes expected match)
(Walmart,Walmart)
(Walmart,WalMart)
(Walmart,Wal Mart)
(Walmart,Wal-Mart)
(Walmart,Wal-mart)
(WalMart,Walmart)
(WalMart,WalMart)
(WalMart,Wal Mart)
(WalMart,Wal-Mart)
(WalMart,Wal-mart)
(Wal Mart,Walmart)
(Wal Mart,WalMart)
(Wal Mart,Wal Mart)
(Wal Mart,Wal-Mart)
(Wal Mart,Wal-mart)
(Wal-Mart,Walmart)
(Wal-Mart,WalMart)
(Wal-Mart,Wal Mart)
(Wal-Mart,Wal-Mart)
(Wal-Mart,Wal-mart)
(Wal-mart,Walmart)
(Wal-mart,WalMart)
(Wal-mart,Wal Mart)
(Wal-mart,Wal-Mart)
(Wal-mart,Wal-mart)
Current limitations with my schema:
1. "Wal-Mart" -> "Walmart",
2. "Wal Mart" -> "Walmart",
3. "Walmart" -> "Wal Mart",
4. "Wal-mart" -> "Walmart",
5. "WalMart" -> "Walmart"
Screenshot of the analyzer:
I tried various combinations of filters trying to resolve these limitations, so I got stumbled by the solution provided at: Solr - case-insensitive search do not work
While it seems to overcome one of the limitations that I have (see #5 WalMart -> Walmart), it is overall worse than what I had earlier. Now it does not work for cases like:
(Wal Mart,WalMart),
(Wal-Mart,WalMart),
(Wal-mart,WalMart),
(WalMart,Wal Mart)
besides cases 1 to 4 as mentioned above
Analyzer after schema change:
Questions:
Why does "WalMart" not match "Walmart" with my initial schema ? Solr analyzer clearly shows me that it had produced 3 tokens during index time:
wal
,mart
,walmart
. During query time: It has produced 1 token:walmart
(while it's not clear why it would produce just 1 token), I fail to understand why it does not match given thatwalmart
is contained in both query and index tokens.The problem that I mentioned here is just a single use-case. There are more slightly complex ones like:
Words with apostrophes: "Mc Donalds", "Mc Donald's", "McDonald's", "Mc donalds", "Mc donald's", "Mcdonald's"
Words with different punctuations: "Mc-Donald Engineering Company, Inc."
In general, what's the best way to go around modeling the schema with this kind of requirement ? NGrams ? Index same data in different fields (in different formats) and use copyField directive (https://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields) ? What are the performance implications of this ?
EDIT: The default operator in my Solr schema is AND. I cannot change it to OR.