Analyzers in elasticsearch
Asked Answered
T

3

46

I'm having trouble understanding the concept of analyzers in elasticsearch with tire gem. I'm actually a newbie to these search concepts. Can someone here help me with some reference article or explain what actually the analyzers do and why they are used?

I see different analyzers being mentioned at elasticsearch like keyword, standard, simple, snowball. Without the knowledge of analyzers I couldn't make out what actually fits my need.

Thacher answered 11/10, 2012 at 9:41 Comment(3)
I actually just found this awesome blog post about how analyzers work in elasticsearch, with concrete examples: found.no/foundation/text-analysis-part-1Fart
That was really worth watching for a beginner to start of with...Thanks @FartThacher
technocratsid.com/elasticsearch-analyzersGiorgio
A
85

Let me give you a short answer.

An analyzer is used at index Time and at search Time. It's used to create an index of terms.

To index a phrase, it could be useful to break it in words. Here comes the analyzer.

It applies tokenizers and token filters. A tokenizer could be a Whitespace tokenizer. It split a phrase in tokens at each space. A lowercase tokenizer will split a phrase at each non-letter and lowercase all letters.

A token filter is used to filter or convert some tokens. For example, a ASCII folding filter will convert characters like ê, é, è to e.

An analyzer is a mix of all of that.

You should read Analysis guide and look at the right all different options you have.

By default, Elasticsearch applies the standard analyzer. It will remove all common english words (and many other filters)

You can also use the Analyze Api to understand how it works. Very useful.

Affirmatory answered 11/10, 2012 at 18:58 Comment(2)
You can play with this plugin to understand the analyzers, tokenizers, filters a little bit better.Achlorhydria
it seems it does not work for edgeNGram tokenizer and filterAlienate
B
12

In Lucene, analyzer is a combination of tokenizer (splitter) + stemmer + stopword filter

In ElasticSearch, analyzer is a combination of

  1. Character filter: "tidy up" a string before it is tokenized e.g. remove HTML tags
  2. Tokenizer: It's used to break up the string into individual terms or tokens. Must have 1 only.
  3. Token filter: change, add or remove tokens. Stemmer is an example of token filter. It's used to get the base of the word e.g. happy and happiness both have the same base is happi.

See Snowball demo here

This is a sample setting:

     {
      "settings":{
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "analyzerWithSnowball" : {
                        "tokenizer" : "standard",
                        "filter" : ["standard", "lowercase", "englishSnowball"]
                    }
                },
                "filter" : {
                    "englishSnowball" : {
                        "type" : "snowball",
                        "language" : "english"
                    }
                }
            }
        }
      }
    }

Ref:

  1. Comparison of Lucene Analyzers
  2. http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/custom-analyzers.html
Burdett answered 16/1, 2015 at 4:12 Comment(0)
P
0

Here's an awesome plugin on github repo. It's somewhat extension of Analyze API. Found it on official elastic plugin list.

What's great is that it shows tokens with all their attributes after every single step. With this it is easy to debug analyzer configuration and see why we got such tokens and where we lost the ones we wanted.

Wish I had found it earlier than today. Thanks to that I just found out why my keyword_repeat token tokenizer seemed to not work correctly. The problem was caused by next token filter: icu_transform (used for transliteration) which unfortunately didn't respect keyword attribute and transformed all of the tokens. Don't know how else would I find the cause if not for this plugin.

Proprietor answered 19/6, 2015 at 17:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.