Split a string in Painless/ELK
Asked Answered
W

3

5

I have a string field "myfield.keyword", where entries have the following format:

AAA_BBBB_CC

DDD_EEE_F

I am trying to create a scripted field that outputs the substring before the first _, a scripted field that outputs the substring between the first and second _ and a scripted field that outputs the substring after the second _.

I was trying to use .split('_') to do this, but found that this method is not available in Painless:

def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = path.split('_')[1];} else {newfield="null";}
return newfield

I then tried the workaround suggested here, but found that I must enable regexes in Elastic (which would not be possible in my case):

def newfield = "";
def path = doc[''myfield.keyword].value;
if (...)
{newfield = /_/.split(path)[1];} else {newfield="null";}
return newfield

Is there a way to do this that does presuppose enabling regexes?

EDIT (after answer):

My question was not well formed. In particular, the string that needs to be split has four occurrences of '_'. Something like:

AAA_BB_CCC_DD_E 

FFF_GGG_HH_JJJJ_KK

So, if I understand correctly, indexOf() and lastIndexOf() cannot give me BB, CCC or DD. I thought that I could adapt your solution, and find the index of the second and third occurrences of _, by using string.indexOf("_", 1) and string.indexOf("_", 2). However, I always get the same result as string.indexOf("_"), without any extra parameters (i.e. the result is always the index of _'s first occurence).

Whited answered 5/11, 2020 at 9:15 Comment(0)
F
5

Enabling regular expressions is not terribly complicated, but it requires restarting your cluster and that might not be easy for you depending on the environment.

Another way to achieve this is to do it the "old way". First you create a reusable script for each of the script fields. What that script does is simply find the first, second, third and last occurrence of the _ symbol and returns the split elements. It takes as input the field name to split and the index of the substring to return:

POST _scripts/my-split
{
  "script": {
    "lang": "painless",
    "source": """
      def str = doc[params.field].value;
      def first = str.indexOf("_");
      def second = first + 1 + str.substring(first + 1).indexOf("_");
      def third = second + 1 + str.substring(second + 1).indexOf("_");
      def last = str.lastIndexOf("_");
      def parts = [
           str.substring(0, first), 
           str.substring(first + 1, second), 
           str.substring(second + 1, third), 
           str.substring(third + 1, last), 
           str.substring(last + 1)
      ];
      return parts[params.index];
    """
  }
}

Then you can simply define one script field for each of the parts like this:

POST test/_search
{
  "script_fields": {
    "first": {
      "script": {
        "id": "my-split",
        "params": {
          "field": "myfield.keyword",
          "index": 0
        }
      }
    },
    "second": {
      "script": {
        "id": "my-split",
        "params": {
          "field": "myfield.keyword",
          "index": 1
        }
      }
    },
    "third": {
      "script": {
        "id": "my-split",
        "params": {
          "field": "myfield.keyword",
          "index": 2
        }
      }
    }
  }
}

The response you get will look like this:

  {
    "_index" : "test",
    "_type" : "_doc",
    "_id" : "ykS-l3UBeO1HTBdDvTZd",
    "_score" : 1.0,
    "fields" : {
      "first" : [
        "AAA"
      ],
      "second" : [
        "BBBB"
      ],
      "third" : [
        "CC"
      ]
    }
  }
Fetter answered 5/11, 2020 at 9:38 Comment(2)
Thank you for such an elegante solution Val. This answers the question I asked. My question, however, was not well formed. In particular, the string that needs to be split has four occurrences of '_'. Something like: and so indexOf() and lastWhited
I've updated my script which now works with up to four occurrences of _. Let me know how it works for you.Fetter
S
4

You could use str.splitOnToken("_") and retrieve each result as an array and loop the array for any of your purposes. You can even split on variable tokens such as:

def message = "[LOG] Something something WARNING: Your warning";
def reason = message.splitOnToken("WARNING: ")[1];

So reason will hold the remaining string: Your warning.

Small answered 7/12, 2021 at 15:7 Comment(0)
L
0

In compeletion to 'George Ts.' Greate Answer...

this worked for me in painless syntax with delimeter '-' and even in Opensearch-dashboards

String[] message = message.splitOnToken("-");

// Return the first part if it exists
if (message.length > 1) {
    return message[0];  // Adjust index as needed
} else {
    return 'empty';
}
Latvia answered 6/11 at 12:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.