From time to time we find that some logs are missing in the ES, while we are able to see them in Kubernetes.
Only problems in logs I was able to find, point out to a problem with the kubernetes parser with things like these in the fluent-bit logs:
[2020/11/22 09:53:18] [debug] [filter:kubernetes:kubernetes.1] could not merge JSON log as requested
Problems seem to go away (at least no more "warn/errors" in fluent-bit logs) once we configure the kubernetes filter with the "Merge_Log" option to "Off". But then of course we loss a big functionality such as actually having fields/values other than "message" itself.
There is no other error/warn message in either fluent-bit or elasticsearch other than this, that's why is my main suspect. Log (log_level in info) is filled with:
k --context contexto09 -n logging-system logs -f -l app=fluent-bit --max-log-requests 31 | grep -iv "\[ info\]"
[2020/11/22 19:45:02] [ warn] [engine] failed to flush chunk '1-1606074289.692844263.flb', retry in 25 seconds: task_id=31, input=appstream > output=es.0
[2020/11/22 19:45:02] [ warn] [engine] failed to flush chunk '1-1606074208.938295842.flb', retry in 25 seconds: task_id=67, input=appstream > output=es.0
[2020/11/22 19:45:08] [ warn] [engine] failed to flush chunk '1-1606074298.662911160.flb', retry in 10 seconds: task_id=76, input=appstream > output=es.0
[2020/11/22 19:45:13] [ warn] [engine] failed to flush chunk '1-1606074310.619565119.flb', retry in 9 seconds: task_id=77, input=appstream > output=es.0
[2020/11/22 19:45:13] [ warn] [engine] failed to flush chunk '1-1606073869.655178524.flb', retry in 1164 seconds: task_id=33, input=appstream > output=es.0
[2020/11/22 19:45:18] [ warn] [engine] failed to flush chunk '1-1606074298.662911160.flb', retry in 282 seconds: task_id=76, input=appstream > output=es.0
[2020/11/22 19:45:21] [ warn] [engine] failed to flush chunk '1-1606073620.626120246.flb', retry in 1974 seconds: task_id=8, input=appstream > output=es.0
[2020/11/22 19:45:21] [ warn] [engine] failed to flush chunk '1-1606074050.441691966.flb', retry in 1191 seconds: task_id=51, input=appstream > output=es.0
[2020/11/22 19:45:22] [ warn] [engine] failed to flush chunk '1-1606074310.619565119.flb', retry in 79 seconds: task_id=77, input=appstream > output=es.0
[2020/11/22 19:45:22] [ warn] [engine] failed to flush chunk '1-1606074319.600878876.flb', retry in 6 seconds: task_id=78, input=appstream > output=es.0
[2020/11/22 19:45:09] [ warn] [engine] failed to flush chunk '1-1606073576.849876665.flb', retry in 1091 seconds: task_id=4, input=appstream > output=es.0
[2020/11/22 19:45:12] [ warn] [engine] failed to flush chunk '1-1606074292.958592278.flb', retry in 898 seconds: task_id=141, input=appstream > output=es.0
[2020/11/22 19:45:14] [ warn] [engine] failed to flush chunk '1-1606074302.347198351.flb', retry in 32 seconds: task_id=143, input=appstream > output=es.0
[2020/11/22 19:45:14] [ warn] [engine] failed to flush chunk '1-1606074253.953778140.flb', retry in 933 seconds: task_id=133, input=appstream > output=es.0
[2020/11/22 19:45:16] [ warn] [engine] failed to flush chunk '1-1606074313.923004098.flb', retry in 6 seconds: task_id=144, input=appstream > output=es.0
[2020/11/22 19:45:18] [ warn] [engine] failed to flush chunk '1-1606074022.933436366.flb', retry in 73 seconds: task_id=89, input=appstream > output=es.0
[2020/11/22 19:45:18] [ warn] [engine] failed to flush chunk '1-1606074304.968844730.flb', retry in 82 seconds: task_id=145, input=appstream > output=es.0
[2020/11/22 19:45:19] [ warn] [engine] failed to flush chunk '1-1606074316.958207701.flb', retry in 10 seconds: task_id=146, input=appstream > output=es.0
[2020/11/22 19:45:19] [ warn] [engine] failed to flush chunk '1-1606074283.907428020.flb', retry in 207 seconds: task_id=139, input=appstream > output=es.0
[2020/11/22 19:45:22] [ warn] [engine] failed to flush chunk '1-1606074313.923004098.flb', retry in 49 seconds: task_id=144, input=appstream > output=es.0
[2020/11/22 19:45:24] [ warn] [engine] failed to flush chunk '1-1606074232.931522416.flb', retry in 109 seconds: task_id=129, input=appstream > output=es.0
...
...
[2020/11/22 19:46:31] [ warn] [engine] chunk '1-1606074022.933436366.flb' cannot be retried: task_id=89, input=appstream > output=es.0
If I enable "debug" for log_level, then I do see these 1. [2020/11/22 09:53:18] [debug] [filter:kubernetes:kubernetes.1] could not merge JSON log as requested
which I assume are the reason why the chunks are failing to flush as I don't have the fialed to flush chunk errors when all "merge_log" are off.
My current fluent-bit config is like this:
kind: ConfigMap
metadata:
labels:
app: fluent-bit
app.kubernetes.io/instance: cluster-logging
chart: fluent-bit-2.8.6
heritage: Tiller
release: cluster-logging
name: config
namespace: logging-system
apiVersion: v1
data:
fluent-bit-input.conf: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Exclude_Path /var/log/containers/cluster-logging-*.log,/var/log/containers/elasticsearch-data-*.log,/var/log/containers/kube-apiserver-*.log
Parser docker
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 15MB
Skip_Long_Lines On
Ignore_Older 7d
DB /tail-db/tail-containers-state.db
DB.Sync Normal
[INPUT]
Name systemd
Path /var/log/journal/
Tag host.*
Max_Entries 1000
Read_From_Tail true
Strip_Underscores true
[INPUT]
Name tail
Path /var/log/containers/kube-apiserver-*.log
Parser docker
Tag kube-apiserver.*
Refresh_Interval 5
Mem_Buf_Limit 5MB
Skip_Long_Lines On
Ignore_Older 7d
DB /tail-db/tail-kube-apiserver-containers-state.db
DB.Sync Normal
fluent-bit-filter.conf: |
[FILTER]
Name kubernetes
Match kube.*
Kube_Tag_Prefix kube.var.log.containers.
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
K8S-Logging.Parser On
K8S-Logging.Exclude On
Merge_Log On
Keep_Log Off
Annotations Off
[FILTER]
Name kubernetes
Match kube-apiserver.*
Kube_Tag_Prefix kube-apiserver.var.log.containers.
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
K8S-Logging.Parser Off
K8S-Logging.Exclude Off
Merge_Log Off
Keep_Log On
Annotations Off
fluent-bit-output.conf: |
[OUTPUT]
Name es
Match logs
Host elasticsearch-data
Port 9200
Logstash_Format On
Retry_Limit 5
Type flb_type
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix logs
Logstash_Prefix_Key index
Generate_ID On
Buffer_Size 2MB
Trace_Output Off
[OUTPUT]
Name es
Match sys
Host elasticsearch-data
Port 9200
Logstash_Format On
Retry_Limit 5
Type flb_type
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix sys-logs
Generate_ID On
Buffer_Size 2MB
Trace_Output Off
[OUTPUT]
Name es
Match host.*
Host elasticsearch-data
Port 9200
Logstash_Format On
Retry_Limit 10
Type flb_type
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix host-logs
Generate_ID On
Buffer_Size 2MB
Trace_Output Off
[OUTPUT]
Name es
Match kube-apiserver.*
Host elasticsearch-data
Port 9200
Logstash_Format On
Retry_Limit 10
Type _doc
Time_Key @timestamp
Replace_Dots On
Logstash_Prefix kube-apiserver
Generate_ID On
Buffer_Size 2MB
Trace_Output Off
fluent-bit-stream-processor.conf: |
[STREAM_TASK]
Name appstream
Exec CREATE STREAM appstream WITH (tag='logs') AS SELECT * from TAG:'kube.*' WHERE NOT (kubernetes['namespace_name']='ambassador-system' OR kubernetes['namespace_name']='argocd' OR kubernetes['namespace_name']='istio-system' OR kubernetes['namespace_name']='kube-system' OR kubernetes['namespace_name']='logging-system' OR kubernetes['namespace_name']='monitoring-system' OR kubernetes['namespace_name']='storage-system') ;
[STREAM_TASK]
Name sysstream
Exec CREATE STREAM sysstream WITH (tag='sys') AS SELECT * from TAG:'kube.*' WHERE (kubernetes['namespace_name']='ambassador-system' OR kubernetes['namespace_name']='argocd' OR kubernetes['namespace_name']='istio-system' OR kubernetes['namespace_name']='kube-system' OR kubernetes['namespace_name']='logging-system' OR kubernetes['namespace_name']='monitoring-system' OR kubernetes['namespace_name']='storage-system') ;
fluent-bit-service.conf: |
[SERVICE]
Flush 3
Daemon Off
Log_Level info
Parsers_File parsers.conf
Streams_File /fluent-bit/etc/fluent-bit-stream-processor.conf
fluent-bit.conf: |
@INCLUDE fluent-bit-service.conf
@INCLUDE fluent-bit-input.conf
@INCLUDE fluent-bit-filter.conf
@INCLUDE fluent-bit-output.conf
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
Merge_Log is off for "kube-apiserver." and so far is working ok, although the end behaviour is not desirable (no field mapping being done). Merge_Log for "kube." is on and is generating the fields in ES as expected...but we are losing logs.
I found the relevant code in kubernetes parser that lead to this error, but I lack the knowledge to understand how to "fix" the error that leads to this message https://github.com/fluent/fluent-bit/blob/master/plugins/filter_kubernetes/kubernetes.c#L162
This is starting to be really frustrating and I can't figure out why this happens or better, how to fix it. Any help please?
Buffer_Size 0
? The default is 32K. And, according to the documentation, Note that if pod specifications exceed the buffer limit, the API response will be discarded when retrieving metadata, and some kubernetes metadata will fail to be injected to the logs.. – Helluvaflb_parser_do
, and from line # 640, goes toflb_parser_json_do
. And, here the message is manipulated in MsgPack format internally. – Helluva