Solr search dash in part number

Asked 30/4, 2015 at 19:26 Answered 2/6, 2015 at 14:4

I'm having some difficulties with either how to construct the Solr query, or how to setup the schema to get searches in our web store to work better.

First some configuration (Solr 4.2.1)

<field name="mfgpartno" type="text_en_splitting_tight" indexed="true" stored="true" />
<field name="mfgpartno_sort" type="string" indexed="true" stored="false" />
<field name="mfgpartno_search" type="sku_partial" indexed="true" stored="true" />

<copyField source="mfgpartno" dest="mfgpartno_sort" />
<copyField source="mfgpartno" dest="mfgpartno_search" />

<fieldType name="sku_partial" class="solr.TextField" omitTermFreqAndPositions="true">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="1" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
        <filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="100" side="front" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    </analyzer>
</fieldType>

Let me break this down into stages (I'm only going to go into enough to replicate the problem - the initial stages aren't using edismax, that is what we've chosen to use on our website):

q=DV\-5PBRP <- With this query I get 18 results but, not the one I'm looking for (this is most likely do to the default df searching on the productname field - fine)
q=mfgpartno_search:DV\-5PBRP <- this gives me the 1 result I'm looking for, but due to the query building I need to do on the website it's better if I can use the q parameter like stage 1.
q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search <- this also gives me the 1 result I'm looking for, but again due to the website search qf needs to be spanning more fields. Because it needs to search more fields (actual qf = productname_search shortdesc_search fulldesc_search mfgpartno_search productname shortdesc fulldesc keywords) to get more accurate searching I implemented stage 4.
q=DV\-5PBRP&defType=edismax&qf=mfgpartno_search&q.op=AND <- with this test I get 0 results - though this works great for most searches on our site.

My big problem with search has been the special characters like the dash that sometimes must be literal, and sometimes act as separators as in product names or descriptions. Sometimes people will even search or replace the dash with a space on a part number search and it should still show relevant data.

I'm kind of stuck on how to get this special character search working - especially as it pertains to this mfgpartno_search field. How might I configure either the schema or query (or both) to get this working?

Babirusa answered 30/4, 2015 at 19:26 Comment(0)

Ok, I think the problem was being over-thought.

I had assumed (based on my config) that the example part number might be indexed like so:

DV-5PBRP -> {DV 5PBRP, DV5PBRP, DV-5PBRP} + NGrams

I had also assumed doing a search on "DV-5PBRP" (literal dash) would match that third option (using a query like #4 in my question).

Yesterday I was alerted to this problem by the same user again, and I got to thinking let's try removing the separator in the search. So now the search has become:

q=DV5PBRP&defType=edismax&qf=mfgpartno_search&q.op=AND

I got the result I was looking for, which means that my solr config is at least giving me an index like the second index option.

Now, I've started trimming separator characters from user input before submitting the search to SOLR. This seems to work beautifully!

Babirusa answered 2/6, 2015 at 14:4 Comment(0)

Maybe you could try the Regular Expression Pattern Tokenizer, and make a suitable regular expression for you article numbers. Lucene (which Solr is built upon) is very focused on tokenization for prose.

What you want here is probably an N-gram split, as well as 1-grams? And maybe that dashes are replaced with spaces, something like

DV-5PBRP -> {DV 5PBRP, DV, 5P, BR, PB, RP, D, V, 5, P, B, R}

As you can see, the index will be quite large for very small fields. Make sure the ranking of the results are heavily weighted for the larger ngrams.

I do think you should remove the stop word list for the article numbers field.

The N-gram size should probably start at 1 or 2.

Simply make sure the various analyzers doesn't:

swallow the dash
remove single or few characters (these are often in stop word lists)
removes numbers

Beograd answered 11/5, 2015 at 22:29 Comment(2)

If you looked at my configuration as far as I can see the indexer is doing all you said. WordDelimiterFilterFactory has preserveOriginal so as not to swallow the dash, the StopFilterFactory is only using the default stopwords.txt, and the NGramFilterFactory has a minGramSize of 4. Less than that tends to have much larger indexes, plus more false positives. – Babirusa 12/5, 2015 at 13:16

Also besides the NGrams in the index, I would also want to see these: DV-5PBRP -> {DV 5PBRP, DV5PBRP, DV-5PBRP} – Babirusa 12/5, 2015 at 13:21

If you are using HTTP get method please encode it and send using

URLEncoder.encode(searchWord,"UTF-8")

This is in the case of java. If you are not using java try corresponding encode code. This will help us to avoid "space", "/" like problems

Harp answered 12/5, 2015 at 7:28 Comment(3)

I'm actually using PHP. I've got my query string for the get method being encoded by http_build_query to handle encoding it properly, then followed by a regex to replace array identifiers in the string "[]" as I found SOLR doesn't like them, but handles multiple terms as an array. It also fixes some characters that shouldn't have been encoded, ^()*:" – Babirusa 12/5, 2015 at 13:13

After using regex, are you sure its encoded and sent? – Harp 14/5, 2015 at 6:5

absolutely - and the tests above were all performed using the SOLR admin pages and the web form SOLR provided (I of course didn't escape the dash in the form - as it would handle it itself). – Babirusa 14/5, 2015 at 14:20