Symbols in query-string for Elasticsearch
Asked Answered
A

1

12

I have "documents" (activerecords) with an attribute called deviations. The attribute has values like "Bin X" "Bin $" "Bin q" "Bin %" etc.

I am trying to use tire/elasticsearch to search the attribute. I am using the whitespace analyzer to index the deviation attribute. Here is my code for creating the indexes:

settings :analysis => {
    :filter  => {
      :ngram_filter => {
        :type => "nGram",
        :min_gram => 2,
        :max_gram => 255
      },
      :deviation_filter => {
        :type => "word_delimiter",
        :type_table => ['$ => ALPHA']
      }
    },
    :analyzer => {
      :ngram_analyzer => {
        :type  => "custom",
        :tokenizer  => "standard",
        :filter  => ["lowercase", "ngram_filter"]
      },
      :deviation_analyzer => {
        :type => "custom",
        :tokenizer => "whitespace",
        :filter => ["lowercase"]
      }
    }
  } do
    mapping do
      indexes :id, :type => 'integer'
      [:equipment, :step, :recipe, :details, :description].each do |attribute|
        indexes attribute, :type => 'string', :analyzer => 'ngram_analyzer'
      end
      indexes :deviation, :analyzer => 'whitespace'
    end
  end

The search seems to work fine when the query string contains no special characters. For example Bin X will return only those records that have the words Bin AND X in them. However, searching for something like Bin $ or Bin % shows all results that have the word Bin almost ignoring the symbol (results with the symbol do show up higher in the search that results without).

Here is the search method I have created

def self.search(params)
    tire.search(load: true) do
        query { string "#{params[:term].downcase}:#{params[:query]}", default_operator: "AND" }
        size 1000
    end
end

and here is how I am building the search form:

<div>
    <%= form_tag issues_path, :class=> "formtastic issue", method: :get do %>
        <fieldset class="inputs">
        <ol>
            <li class="string input medium search query optional stringish inline">
                <% opts = ["Description", "Detail","Deviation","Equipment","Recipe", "Step"] %>
                <%= select_tag :term, options_for_select(opts, params[:term]) %>
                <%= text_field_tag :query, params[:query] %>
                <%= submit_tag "Search", name: nil, class: "btn" %>
            </li>
        </ol>
        </fieldset>
    <% end %>
</div>
Augustineaugustinian answered 25/4, 2013 at 2:19 Comment(4)
You don't just escape the characters have a meaning to Lucene with a backslash? Of course, in a Ruby string you'd need a double backslash \\ to escape the ruby character before it hits the Elastic Search api. I've not tried Tire, so I don't know if it works in your world. FYI, here is a quick reference to the characters affected: docs.lucidworks.com/display/lweug/…Derekderelict
I don't think this is the issue because queries Bin $ or Bin % are affected, but they are not listed in the link above as a special character.Augustineaugustinian
I know from my own experience of full text search in databases (Oracle I think it was, and MySQL for LIKE tests in varchar or text fields) that % is a 'match everything' character. Maybe that link above is incomplete, or maybe its not relevant to your issue. Have you tried escaping to see if that solves the problem?Derekderelict
Escaping the special characters with \ (for example Bin \%) or \\ (for example Bin \\%) has no effect on the behavior.Augustineaugustinian
S
30

You can sanitize your query string. Here is a sanitizer that works for everything that I've tried throwing at it:

def sanitize_string_for_elasticsearch_string_query(str)
  # Escape special characters
  # http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#Escaping Special Characters
  escaped_characters = Regexp.escape('\\/+-&|!(){}[]^~*?:')
  str = str.gsub(/([#{escaped_characters}])/, '\\\\\1')

  # AND, OR and NOT are used by lucene as logical operators. We need
  # to escape them
  ['AND', 'OR', 'NOT'].each do |word|
    escaped_word = word.split('').map {|char| "\\#{char}" }.join('')
    str = str.gsub(/\s*\b(#{word.upcase})\b\s*/, " #{escaped_word} ")
  end

  # Escape odd quotes
  quote_count = str.count '"'
  str = str.gsub(/(.*)"(.*)/, '\1\"\3') if quote_count % 2 == 1

  str
end

params[:query] = sanitize_string_for_elasticsearch_string_query(params[:query])
Saum answered 8/5, 2013 at 13:48 Comment(7)
I needed to add forward slash also to the escaped_characters array. escaped_characters = Regexp.escape('\\+-&|!(){}[]^~*?:\/') as it was breaking for strings with forward slash.Kinship
That's strange since / is not a special character in Lucene: lucene.apache.org/core/old_versioned_docs/versions/2_9_1/…Saum
Hi, please see 50.16.250.253:9200/locations/location/_search?q=123%2F345 ..I think this is giving an error, because / is inside the string...when I escape with a `\`, the error is resolved, 50.16.250.253:9200/locations/location/_search?q=123%5C%2F345Kinship
Hi, quote escape regexp should be str = str.gsub(/(.*)"(.*)/, '\1\"\2') if quote_count % 2 == 1, because there is just two groupsOperose
Just want to note: forward slash is now a special character and should be escaped. lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/…Caryophyllaceous
I've translated this solution to Scala here: #32108101Seymourseys
I've translated the solution to Python here gist.github.com/eranhirs/5c9ef5de8b8731948e6ed14486058842Pentamerous

© 2022 - 2024 — McMap. All rights reserved.