Kafka broker node goes down with "Too many open files" error
Asked Answered
M

1

8

We have a 3 node Kafka cluster deployment, with a total of 35 topics with 50 partitions each. In total, we have configured the replication factor=2. We are seeing a very strange problem that intermittently Kafka node stops responding with error:

ERROR Error while accepting connection (kafka.network.Acceptor)
java.io.IOException: Too many open files
  at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
  at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
  at kafka.network.Acceptor.accept(SocketServer.scala:460)
  at kafka.network.Acceptor.run(SocketServer.scala:403)
  at java.lang.Thread.run(Thread.java:745)

We have deployed the latest Kafka version and using spring-kafka as client:

kafka_2.12-2.1.0 (CentOS Linux release 7.6.1810 (Core))

  • There are three observations:
    1. If we do lsof -p <kafka_pid>|wc -l, we get the total number of open descriptors as around 7000 only.
    2. If we just do lsof|grep kafka|wc -l, we get around 1.5 Million open FD's. We have checked they are all belonging to Kafka process only.
    3. If we downgrade the system to Centos6, then the out of lsof|grep kafka|wc -l comes back to 7000.

We have tried setting the file limits to very large, but still we get this issue. Following is the limit set for the kafka process:

cat /proc/<kafka_pid>/limits
    Limit                     Soft Limit           Hard Limit           Units
    Max cpu time              unlimited            unlimited            seconds
    Max file size             unlimited            unlimited            bytes
    Max data size             unlimited            unlimited            bytes
    Max stack size            8388608              unlimited            bytes
    Max core file size        0                    unlimited            bytes
    Max resident set          unlimited            unlimited            bytes
    Max processes             513395               513395               processes
    Max open files            500000               500000               files
    Max locked memory         65536                65536                bytes
    Max address space         unlimited            unlimited            bytes
    Max file locks            unlimited            unlimited            locks
    Max pending signals       513395               513395               signals
    Max msgqueue size         819200               819200               bytes
    Max nice priority         0                    0
    Max realtime priority     0                    0
    Max realtime timeout      unlimited            unlimited            us

We have few questions here:

  • Why the broker is going down intermittently when we have already configured so large process limits? Does kafka require even more available file descriptors?
  • Why is there a difference in output of lsof and lsof -p in centos 6 and centos 7?
  • Is the number of broker nodes 3 less? Seeing the replication factor as 2, we have around 100 partitions per topic distributed among 3 nodes, thus around 33 partitions per node.

Edit 1: Seems like we are hitting the Kafka issue: https://issues.apache.org/jira/browse/KAFKA-7697

We will plan to downgrade the Kafka version to 2.0.1.

Mercer answered 12/3, 2019 at 17:20 Comment(10)
What is the traffic that is going through Kafka ? Have you insured that the producer and consumers are doing the things right and doing what exactly they have to do... ? Number of File Descriptors are indeed very large.. ? Generally we use ubuntu with a very large deployment of 100K msg / sec and have never seen such problem..Waggle
Currently the overall traffic over all topics is just 2K msg/sec. There seems nothing wrong with producers and consumers. One thing to mention is that we have changed the kafka log retention policy to 1 hour as our messages need not be long lived.Mercer
Have you adjusted the KAFKA_LOG_ROLL_MS ? What is it's value currently ?Waggle
Yeah.. We have configured following values: log.retention.check.interval.ms=300000 log.retention.ms=3600000 log.roll.ms=3600000Mercer
These values look OK , as you have mentioned that you do not see this issueo n CentOS6 , so don't you think it is perhaps related to the OS ..Waggle
Rather I feel the issue could be of only display between lsof in centos6 or centos7. Should not be an OS issue I believe.Mercer
Does the Kafka Broker stops responding intermittently on CentOS 6 as well..Waggle
On Centos6 we have just the performance envt. Although the real scenarios of using all topics is hard to reproduce, but we have seen this issue on Centos6 when the configured FDs were 50K.Mercer
For sure not a Kafka issue .. have it running smoothly in prod on ubuntu.. though not the latest version...Waggle
Can you check "maxClientCnxns" property under zookeeper.properties file, if it is greater then 0 then change it to 0 and check.Excoriate
P
1

Based on the earlier update of the asker he found that he was hitting https://issues.apache.org/jira/browse/KAFKA-7697

A quick check now shows that it is resolved, and based on the jira it seems that the solution for this problem is using Kafka 2.1.1 and above.

Percyperdido answered 22/7, 2021 at 12:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.