Parse logs containing python tracebacks using logstash
Asked Answered
E

1

8

I have been trying to parse my python traceback logs using logstash. My logs look like this:

[pid: 26422|app: 0|req: 73/73] 192.168.1.1 () {34 vars in 592 bytes} [Wed Feb 18 13:35:55 2015] GET /data => generated 2538923 bytes in 4078 msecs (HTTP/1.1 200) 2 headers in 85 bytes (1 switches on core 0)
Traceback (most recent call last):
  File "/var/www/analytics/parser.py", line 257, in parselogfile
    parselogline(basedir, lne)
  File "/var/www/analytics/parser.py", line 157, in parselogline
    pval = understandpost(parts[3])
  File "/var/www/analytics/parser.py", line 98, in understandpost
    val = json.loads(dct["events"])
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 355 (char 354)

So far I've been able to parse the logs except the last line i.e.

ValueError: Unterminated string starting at: line 1 column 355 (char 354)

I'm using the multiline filter to do so. My logstash configuration looks something like this:

filter {

    multiline {
        pattern => "^Traceback"
        what => "previous"
    }

    multiline {
        pattern => "^ "
        what => "previous"
    }


    grok {
        match => [
            "message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)%{GREEDYDATA:traceback}"
            ]
    }

    if "_grokparsefailure" in [tags] {
        multiline {
            pattern => "^.*$"
            what => "previous"
            negate => "true"
        }
    }

    if "_grokparsefailure" in [tags] {
        grok {
            match => [
                  "message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)%{GREEDYDATA:traceback}"
        ]
            remove_tag => ["_grokparsefailure"]
        }
    }
}

But my last line isn't parsing. Instead it is still giving me an error and has also increased the processing time exponentially. Any advice on how to parse the last line of the traceback?

Ellanellard answered 22/6, 2015 at 10:27 Comment(3)
I've not seen a config with three multiline configs before. Usually, you find the pattern that identifies the beginning of one section (which for you could be "[pid" or "Traceback") and then combine everything to the next occurrence into one message.Bronny
Did some homework and solved it using one multiline filter, I just had to identify the first line which starts with [ for my log identification and other lines are to be appended using the multiline filter. I will post the solution here in case someone else needs to parse python logs.Ellanellard
Also three multiline filters used too much processing power and slowed the logstash down whereas with one such filter, it works like a charm!Ellanellard
E
8

Well, I found a solution. So the approach I followed is that I will ignore the starting of a log message which starts with '['and all the other lines will be appended at the end of the previous message. Then grok filter can be applied and the traceback can be parsed. Note that I have to apply two grok filters:

  1. For when there is a traceback with GREEDYDATA to get the traceback.

  2. For when there is no traceback, GREEDYDATA parsing fails and I'll have to remove the _grokparsefailure tag and then again apply grok without GREEDYDATA. This is done with the help of if block.

The final logstash filter looks something like this:

filter {

    multiline {
        pattern => "^[^\[]"
        what => "previous"
    }



    grok {
        match => [
            "message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)%{GREEDYDATA:traceback}"
        ]
    }

    if "_grokparsefailure" in [tags] {
        grok {
            match => [
            "message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)"
                ]
            remove_tag => ["_grokparsefailure"]
        }
    }

    else {
        mutate {
            convert => {"traceback" => "string"}
        }
    }

    date {
        match => ["timestamp", "dd/MM/YYYY:HH:MM:ss Z"]
        locale => en
    }
    geoip {
        source => "clientip"
    }
    useragent {
        source => "agent"
        target => "Useragent"
    }
}

Alternatively, if you don't want to use the if block to check another grok pattern and remove the _grokparsefailure, you can use the first grok filter to check for both the message types by including multiple message-pattern checks in the match array of grok filter. It can be done like this:

        grok {
            match => [
            "message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)",
            "message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)%{GREEDYDATA:traceback}"
                ]
        }

And there is a third approach as well (possibly the most elegant one). It looks something like this:

grok {
    match => [
        "message", "\[pid\: %{NUMBER:process_id:int}\|app: 0\|req: %{NUMBER}/%{NUMBER}\] %{IPORHOST:clientip} \(\) \{%{NUMBER:vars:int} vars in %{NUMBER:bytes:int} bytes\} \[%{GREEDYDATA:timestamp}\] %{WORD:method} /%{GREEDYDATA:referrer} \=\> generated %{NUMBER:generated_bytes:int} bytes in %{NUMBER} msecs \(HTTP/%{NUMBER} %{NUMBER:status_code:int}\) %{NUMBER:headers:int} headers in %{NUMBER:header_bytes:int} bytes \(%{NUMBER:switches:int} switches on core %{NUMBER:core:int}\)(%{GREEDYDATA:traceback})?"
    ]
}

Note that in this method, the field whose existence is optional has to be enclosed in "()?". Here, (%{GREEDYDATA:traceback})?

Thus, the grok filter sees that if the field is available, it will be parsed. Otherwise, it will be skipped.

Ellanellard answered 23/6, 2015 at 8:30 Comment(6)
The multiline looks better! Note that you can make the GREEDYDATA portion optional in your pattern. (foo)? is the general syntax, I believe.Bronny
Let me search for it, a simpler and smaller approach is always plausible! ;)Ellanellard
Found one. It doesn't use foo, instead we can define multiple message match patterns in the first grok filter, therefore eliminating the need of if block. Editing my answer to reference that approach as well.Ellanellard
Making it optional in your pattern will be easier to maintain than listing two slightly different patterns in the grok. It's not the "foo" that's important, but the " ( )? ".Bronny
I tried it and yes, it worked. Indeed an elegant approach! Editing the answer again to reference this method as well.Ellanellard
This pattern worked for me too. But since I'll be using Filebeat with Logstash, the documentation suggests managing multiline messages on Filebeat instead.Hadst

© 2022 - 2024 — McMap. All rights reserved.