Log entries lost while using fluent-bit with kubernetes filter and elasticsearch output

From time to time we find that some logs are missing in the ES, while we are able to see them in Kubernetes.

Only problems in logs I was able to find, point out to a problem with the kubernetes parser with things like these in the fluent-bit logs: [2020/11/22 09:53:18] [debug] [filter:kubernetes:kubernetes.1] could not merge JSON log as requested

Problems seem to go away (at least no more "warn/errors" in fluent-bit logs) once we configure the kubernetes filter with the "Merge_Log" option to "Off". But then of course we loss a big functionality such as actually having fields/values other than "message" itself.

There is no other error/warn message in either fluent-bit or elasticsearch other than this, that's why is my main suspect. Log (log_level in info) is filled with:

k --context contexto09 -n logging-system logs -f -l app=fluent-bit --max-log-requests 31 | grep -iv "\[ info\]"
[2020/11/22 19:45:02] [ warn] [engine] failed to flush chunk '1-1606074289.692844263.flb', retry in 25 seconds: task_id=31, input=appstream > output=es.0
[2020/11/22 19:45:02] [ warn] [engine] failed to flush chunk '1-1606074208.938295842.flb', retry in 25 seconds: task_id=67, input=appstream > output=es.0
[2020/11/22 19:45:08] [ warn] [engine] failed to flush chunk '1-1606074298.662911160.flb', retry in 10 seconds: task_id=76, input=appstream > output=es.0
[2020/11/22 19:45:13] [ warn] [engine] failed to flush chunk '1-1606074310.619565119.flb', retry in 9 seconds: task_id=77, input=appstream > output=es.0
[2020/11/22 19:45:13] [ warn] [engine] failed to flush chunk '1-1606073869.655178524.flb', retry in 1164 seconds: task_id=33, input=appstream > output=es.0
[2020/11/22 19:45:18] [ warn] [engine] failed to flush chunk '1-1606074298.662911160.flb', retry in 282 seconds: task_id=76, input=appstream > output=es.0
[2020/11/22 19:45:21] [ warn] [engine] failed to flush chunk '1-1606073620.626120246.flb', retry in 1974 seconds: task_id=8, input=appstream > output=es.0
[2020/11/22 19:45:21] [ warn] [engine] failed to flush chunk '1-1606074050.441691966.flb', retry in 1191 seconds: task_id=51, input=appstream > output=es.0
[2020/11/22 19:45:22] [ warn] [engine] failed to flush chunk '1-1606074310.619565119.flb', retry in 79 seconds: task_id=77, input=appstream > output=es.0
[2020/11/22 19:45:22] [ warn] [engine] failed to flush chunk '1-1606074319.600878876.flb', retry in 6 seconds: task_id=78, input=appstream > output=es.0
[2020/11/22 19:45:09] [ warn] [engine] failed to flush chunk '1-1606073576.849876665.flb', retry in 1091 seconds: task_id=4, input=appstream > output=es.0
[2020/11/22 19:45:12] [ warn] [engine] failed to flush chunk '1-1606074292.958592278.flb', retry in 898 seconds: task_id=141, input=appstream > output=es.0
[2020/11/22 19:45:14] [ warn] [engine] failed to flush chunk '1-1606074302.347198351.flb', retry in 32 seconds: task_id=143, input=appstream > output=es.0
[2020/11/22 19:45:14] [ warn] [engine] failed to flush chunk '1-1606074253.953778140.flb', retry in 933 seconds: task_id=133, input=appstream > output=es.0
[2020/11/22 19:45:16] [ warn] [engine] failed to flush chunk '1-1606074313.923004098.flb', retry in 6 seconds: task_id=144, input=appstream > output=es.0
[2020/11/22 19:45:18] [ warn] [engine] failed to flush chunk '1-1606074022.933436366.flb', retry in 73 seconds: task_id=89, input=appstream > output=es.0
[2020/11/22 19:45:18] [ warn] [engine] failed to flush chunk '1-1606074304.968844730.flb', retry in 82 seconds: task_id=145, input=appstream > output=es.0
[2020/11/22 19:45:19] [ warn] [engine] failed to flush chunk '1-1606074316.958207701.flb', retry in 10 seconds: task_id=146, input=appstream > output=es.0
[2020/11/22 19:45:19] [ warn] [engine] failed to flush chunk '1-1606074283.907428020.flb', retry in 207 seconds: task_id=139, input=appstream > output=es.0
[2020/11/22 19:45:22] [ warn] [engine] failed to flush chunk '1-1606074313.923004098.flb', retry in 49 seconds: task_id=144, input=appstream > output=es.0
[2020/11/22 19:45:24] [ warn] [engine] failed to flush chunk '1-1606074232.931522416.flb', retry in 109 seconds: task_id=129, input=appstream > output=es.0
...
...
[2020/11/22 19:46:31] [ warn] [engine] chunk '1-1606074022.933436366.flb' cannot be retried: task_id=89, input=appstream > output=es.0

If I enable "debug" for log_level, then I do see these 1. [2020/11/22 09:53:18] [debug] [filter:kubernetes:kubernetes.1] could not merge JSON log as requested which I assume are the reason why the chunks are failing to flush as I don't have the fialed to flush chunk errors when all "merge_log" are off.

My current fluent-bit config is like this:

kind: ConfigMap
metadata:
  labels:
    app: fluent-bit
    app.kubernetes.io/instance: cluster-logging
    chart: fluent-bit-2.8.6
    heritage: Tiller
    release: cluster-logging
  name: config
  namespace: logging-system
apiVersion: v1
data:
  fluent-bit-input.conf: |
    [INPUT]
        Name             tail
        Path             /var/log/containers/*.log
        Exclude_Path     /var/log/containers/cluster-logging-*.log,/var/log/containers/elasticsearch-data-*.log,/var/log/containers/kube-apiserver-*.log
        Parser           docker
        Tag              kube.*
        Refresh_Interval 5
        Mem_Buf_Limit    15MB
        Skip_Long_Lines  On
        Ignore_Older     7d
        DB               /tail-db/tail-containers-state.db
        DB.Sync          Normal
    [INPUT]
        Name            systemd
        Path            /var/log/journal/
        Tag             host.*
        Max_Entries     1000
        Read_From_Tail  true
        Strip_Underscores  true
    [INPUT]
        Name             tail
        Path             /var/log/containers/kube-apiserver-*.log
        Parser           docker
        Tag              kube-apiserver.*
        Refresh_Interval 5 
        Mem_Buf_Limit    5MB
        Skip_Long_Lines  On
        Ignore_Older     7d
        DB               /tail-db/tail-kube-apiserver-containers-state.db
        DB.Sync          Normal

  fluent-bit-filter.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_Tag_Prefix     kube.var.log.containers.
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        K8S-Logging.Parser  On
        K8S-Logging.Exclude On
        Merge_Log           On
        Keep_Log            Off
        Annotations         Off
    [FILTER]
        Name                kubernetes
        Match               kube-apiserver.*
        Kube_Tag_Prefix     kube-apiserver.var.log.containers.
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        K8S-Logging.Parser  Off
        K8S-Logging.Exclude Off
        Merge_Log           Off
        Keep_Log            On
        Annotations         Off

  fluent-bit-output.conf: |
    [OUTPUT]
        Name  es
        Match logs
        Host  elasticsearch-data
        Port  9200
        Logstash_Format On
        Retry_Limit 5
        Type  flb_type
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix logs
        Logstash_Prefix_Key index
        Generate_ID On
        Buffer_Size 2MB
        Trace_Output Off
    [OUTPUT]
        Name  es
        Match sys
        Host  elasticsearch-data
        Port  9200
        Logstash_Format On
        Retry_Limit 5
        Type  flb_type
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix sys-logs
        Generate_ID On
        Buffer_Size 2MB
        Trace_Output Off
    [OUTPUT]
        Name  es
        Match host.*
        Host  elasticsearch-data
        Port  9200
        Logstash_Format On
        Retry_Limit 10
        Type  flb_type
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix host-logs
        Generate_ID On
        Buffer_Size 2MB
        Trace_Output Off
    [OUTPUT]
        Name  es
        Match kube-apiserver.*
        Host  elasticsearch-data
        Port  9200
        Logstash_Format On
        Retry_Limit 10
        Type _doc 
        Time_Key @timestamp
        Replace_Dots On
        Logstash_Prefix kube-apiserver
        Generate_ID On
        Buffer_Size 2MB
        Trace_Output Off

  fluent-bit-stream-processor.conf: |
    [STREAM_TASK]
        Name   appstream
        Exec   CREATE STREAM appstream WITH (tag='logs') AS SELECT * from TAG:'kube.*' WHERE NOT (kubernetes['namespace_name']='ambassador-system' OR kubernetes['namespace_name']='argocd' OR kubernetes['namespace_name']='istio-system' OR kubernetes['namespace_name']='kube-system' OR kubernetes['namespace_name']='logging-system' OR kubernetes['namespace_name']='monitoring-system' OR kubernetes['namespace_name']='storage-system') ;
    [STREAM_TASK]
        Name   sysstream
        Exec   CREATE STREAM sysstream WITH (tag='sys') AS SELECT * from TAG:'kube.*' WHERE (kubernetes['namespace_name']='ambassador-system' OR kubernetes['namespace_name']='argocd' OR kubernetes['namespace_name']='istio-system' OR kubernetes['namespace_name']='kube-system' OR kubernetes['namespace_name']='logging-system' OR kubernetes['namespace_name']='monitoring-system' OR kubernetes['namespace_name']='storage-system') ;

  fluent-bit-service.conf: |
    [SERVICE]
        Flush        3
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf
        Streams_File /fluent-bit/etc/fluent-bit-stream-processor.conf

  fluent-bit.conf: |
    @INCLUDE fluent-bit-service.conf
    @INCLUDE fluent-bit-input.conf
    @INCLUDE fluent-bit-filter.conf
    @INCLUDE fluent-bit-output.conf
    
  parsers.conf: |
    [PARSER]
        Name         docker
        Format       json
        Time_Key     time
        Time_Format  %Y-%m-%dT%H:%M:%S.%L
        Time_Keep    On

Merge_Log is off for "kube-apiserver." and so far is working ok, although the end behaviour is not desirable (no field mapping being done). Merge_Log for "kube." is on and is generating the fields in ES as expected...but we are losing logs.

I found the relevant code in kubernetes parser that lead to this error, but I lack the knowledge to understand how to "fix" the error that leads to this message https://github.com/fluent/fluent-bit/blob/master/plugins/filter_kubernetes/kubernetes.c#L162

This is starting to be really frustrating and I can't figure out why this happens or better, how to fix it. Any help please?

Recommended topics

Hot tags