Elasticsearch "pattern_replace", replacing whitespaces while analyzing
Asked Answered
E

1

6

Basically I want to remove all whitespaces and tokenize the whole string as a single token. (I will use nGram on top of that later on.)

This is my index settings:

"settings": {
 "index": {
  "analysis": {
    "filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": " ",
        "replacement": ""
      }
    },
    "analyzer": {
      "meliuz_analyzer": {
        "filter": [
          "lowercase",
          "whitespace_remove"
        ],
        "type": "custom",
        "tokenizer": "standard"
      }
    }
  }
}

Instead of "pattern": " ", I tried "pattern": "\\u0020" and \\s , too.

But when I analyze the text "beleza na web", it still creates three separate tokens: "beleza", "na" and "web", instead of one single "belezanaweb".

Emphasize answered 26/4, 2015 at 3:23 Comment(0)
S
19

The analyzer analyzes a string by tokenizing it first then applying a series of token filters. You have specified tokenizer as standard means the input is already tokenized using standard tokenizer which created the tokens separately. Then pattern replace filter is applied to the tokens.

Use keyword tokenizer instead of your standard tokenizer. Rest of the mapping is fine. You can change your mapping as below

"settings": {
 "index": {
  "analysis": {
    "filter": {
      "whitespace_remove": {
        "type": "pattern_replace",
        "pattern": " ",
        "replacement": ""
      }
    },
    "analyzer": {
      "meliuz_analyzer": {
        "filter": [
          "lowercase",
          "whitespace_remove",
          "nGram"
        ],
        "type": "custom",
        "tokenizer": "keyword"
      }
    }
  }
}
Shawntashawwal answered 26/4, 2015 at 4:32 Comment(1)
To expand further on why you would use Keyword Tokenizer over the Standard Tokenizer, the Keyword Tokenizer takes the input as a single token, whereas the Standard Tokenizer separates the tokens using the standard list of separators (which can be customized). So the sentence "stack overflow" gets tokenized as "stack" and "overflow" with the Standard Tokenizer, but the Keyword Tokenizer creates the token "stack overflow" which the regex can then work on as a single input.Bronchi

© 2022 - 2024 — McMap. All rights reserved.