Search in solr with special characters
Asked Answered
P

3

8

I have a problem with a search with special characters in solr. My document has a field "title" and sometimes it can be like "Titanic - 1999" (it has the character "-"). When i try to search in solr with "-" i receive a 400 error. I've tried to escape the character, so I tried something like "-" and "\-". With that changes solr doesn't response me with an error, but it returns 0 results.

How can i search in the solr admin with that special character(something like "-" or "'"???

Regards

UPDATE Here you can see my current solr scheme https://gist.github.com/cpalomaresbazuca/6269375

My search is to the field "Title".

excerpt from the schema.xml:

 ...
 <!-- A general text field that has reasonable, generic
     cross-language defaults: it tokenizes with StandardTokenizer,
     removes stop words from case-insensitive "stopwords.txt"
     (empty by default), and down cases.  At query time only, it
     also applies synonyms. -->
    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>

        </analyzer>
    </fieldType>
...
<field name="Title" type="text_general" indexed="true" stored="true"/>
Perish answered 16/8, 2013 at 16:4 Comment(4)
Do you put inverted commas round it when you search? Like select?q=title:"Titanic - 1999". Putting it in inverted commas should do an exact searchPrecisian
What does your schema look like for this field? I am interested to know what field definition you have for this field.Ricardo
<field name="title" type="text_general" stored="true" indexed="true"/>Precisian
@AllanMacmillan I'v tried and that works, but when someone just put "-" it doesn't. That's my problem. I've updated my question with the solr scheme.Perish
S
11

You are using the standard text_general field for the title attribute. This might not be a good choice. text_general is meant to be for huge chunks of text (or at least sentences) and not so much for exact matching of names or titles.

The problem here is that text_general uses the StandardTokenizerFactory.

 <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <!-- in this example, we will only use synonyms at query time
             <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
             -->
            <filter class="solr.LowerCaseFilterFactory"/>
        
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            
        </analyzer>
    </fieldType>

StandardTokenizerFactory does the following:

A good general purpose tokenizer that strips many extraneous characters and sets token types to meaningful values. Token types are only useful for subsequent token filters that are type-aware of the same token types.

This means the '-' character will be completely ignored and be used to tokenize the String.

"kong-fu" will be represented as "kong" and "fu". The '-' disappears.

This does also explain why select?q=title:\- won't work here.

Choose a better fitting field type:

Instead of the StandardTokenizerFactory you could use the solr.WhitespaceTokenizerFactory, that only splits on whitespace for exact matching of words. So making your own field type for the title attribute would be a solution.

Solr also has a fieldtype called text_ws. Depending on your requirements this might be enough.

Submerse answered 2/3, 2015 at 18:20 Comment(0)
P
1

To search for your exact phrase put inverted commas round it:

select?q=title:"Titanic - 1999" 

If you just want to search for that special character then you will need to escape it:

select?q=title:\-

Also check: Special characters (-&+, etc) not working in SOLR Query

If you know exactly which special characters you dont want to use then you can add this to the regex-normalize.xml

<regex> 
  <pattern>&#x2D;</pattern> 
  <substitution>%2D</substitution> 
</regex>

This will replace all "-" with %2D, so when you search, as long as you search for %2D instead of the "-" it will work fine

Precisian answered 19/8, 2013 at 14:23 Comment(3)
I've tried: select?q=title:\- But it still returns 0 results :( How can i know if the character "-" is not being indexed?.Perish
Try what I suggested in the second half, changing the regex-normalize.xml. I tried it myself and it works perfectlyPrecisian
Should be in the conf folderPrecisian
M
1

I spent a lot of time getting this done. Here is a clear step-by-step things to be done to query special characters in SolR. Hope it helps someone.

  1. Edit the schema.xml file and find the solr.TextField that you are using.
  2. Under both, "index" and query" analyzers modify the WordDelimiterFilterFactory and add types="characters.txt" Something like:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
     <analyzer type="index">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter catenateAll="0" catenateNumbers="0" catenateWords="0" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1" types="characters.txt"/>
    </analyzer>
    </fieldType>
    
  3. Ensure that you use WhitespaceTokenizerFactory as the tokenizer as shown above.

  4. Your characters.txt file can have entries like-

     \# => ALPHA
    @ => ALPHA
    \u0023 => ALPHA
                    ie:- pointing to ALPHA only.
    
  5. Clear the data, re-index and query for the entered characters. It will work.

Mucilaginous answered 27/7, 2016 at 7:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.