Open connections via Spring Websocket STOMP cause our server to die
Asked Answered
J

2

12

So we use Spring websocket STOMP + RabbitMQ on the backend and we are having troubles with the open file descriptors. After a certain time, we hit the limit on the server and the server does not accept any connections, including both websockets and API endpoints.

2018-09-14 18:04:13.605  INFO 1288 --- [MessageBroker-1] 
o.s.w.s.c.WebSocketMessageBrokerStats    : WebSocketSession[2 current WS(2)- 
HttpStream(0)-HttpPoll(0), 1159 total, 0 closed abnormally (0 connect 
failure, 0 send limit, 63 transport error)], stompSubProtocol[processed 
CONNECT(1014)-CONNECTED(1004)-DISCONNECT(0)], stompBrokerRelay[9 sessions, 
127.0.0.1:61613 (available), processed CONNECT(1015)-CONNECTED(1005)- 
DISCONNECT(1011)], inboundChannel[pool size = 2, active threads = 2, queued 
tasks = 2, completed tasks = 12287], outboundChannelpool size = 0, active 
threads = 0, queued tasks = 0, completed tasks = 4225], sockJsScheduler[pool 
size = 1, active threads = 1, queued tasks = 3, completed tasks = 683]

And we are getting the below exceptions:

2018-09-14 18:04:13.761 ERROR 1288 --- [http-nio-127.0.0.1-8443-Acceptor-0] 
org.apache.tomcat.util.net.NioEndpoint   : Socket accept failed

java.io.IOException: Too many open files
    at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
    at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
    at org.apache.tomcat.util.net.NioEndpoint$Acceptor.run(NioEndpoint.java:455)
    at java.lang.Thread.run(Thread.java:748)

The default file descriptor limit for linux is 1024 and even if we increase it to something like 65000, it will reach the limit at some point no matter what.

We want to solve this problem from the backend side and preferably by the Spring without any workarounds. Any ideas?

UPDATE

RabbitMQ and the application reside on different servers. Actually, RabbitMQ works on Compose. We can reproduce this issue by not sending DISCONNECT messages from the client.

UPDATE 2

Today I realized that all the file descriptors and java threads always stay there, no matter what happens. I have implemented a workaround that includes sending DISCONNECT messages from Spring and closing the WebSocketSession objects and no changes. I have implemented these by checking the below links:

And as a side note, the server side sends the messages like this: simpMessagingTemplate.convertAndSend("/queue/" + sessionId, payload). In this way, we ensure that each client gets the corresponding message by the relevant sessionId.

Is this some sort of bug? Why aren't the file descriptors being closed? Did nobody encounter this issue before?

UPDATE 3

Every time a socket is being closed, I see the below exception. It doesn't matter how it's being closed, either by a DISCONNECT message from the client or webSocketSession.close() code from server.

[reactor-tcp-io-66] o.s.m.s.s.StompBrokerRelayMessageHandler : TCP connection failure in session 45r7i9u3: Transport failure: epoll_ctl(..) failed: No such file or directory
io.netty.channel.unix.Errors$NativeIoException: epoll_ctl(..) failed: No such file or directory
at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
at io.netty.channel.epoll.Native.epollCtlMod(Native.java:134)
at io.netty.channel.epoll.EpollEventLoop.modify(EpollEventLoop.java:186)
at io.netty.channel.epoll.AbstractEpollChannel.modifyEvents(AbstractEpollChannel.java:272)
at io.netty.channel.epoll.AbstractEpollChannel.clearFlag(AbstractEpollChannel.java:125)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.clearEpollRdHup(AbstractEpollChannel.java:450)
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollRdHupReady(AbstractEpollChannel.java:442)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:417)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:310)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at java.lang.Thread.run(Thread.java:748)

So I changed the level of logs to TRACE and I see that the websockets are really being closed, but immediately these exceptions are being thrown. So at this point, I am really suspicious about this exception. The number of hung java threads always go hand-in-hand with the number of websockets, i.e. creating 400 websockets always end up ~400 hung threads in the main process. And the memory resources are never being released.

Googling this exception ends up only with the 4 below results: (The rest is other exceptions)

Updating the netty library to the latest version (4.1.29.Final) didn't work either, so I changed the tags of the question accordingly. I am also considering to create an issue against netty. I have tried a lot of things and experimented several times on the application level but nothing seems to be working. I am open to any kind of ideas at this point.

Jaehne answered 19/9, 2018 at 12:50 Comment(7)
What spring / rabbitMQ versions are you using? Maybe it's some hidden bug in the libraries. Is Rabbit running on the same server as the affected application? You could try analyzing using JVisualVM or another similar tool, especially searching heapdump for objects holding the open sockets etc after IOException occursExtrude
Hi @KamilPiwowarski. The RabbitMq version is 3.7.7 and Spring Boot version is 1.5.9. This case happens when the client side does not send DISCONNECT messages. So I see it as a security hole. When I try to write a JS code and don't send disconnects, then after around 1000 messages, the server fails. I'm not sure how other people solve this problem since it looks like a common issue. Rabbit is not on the same server, it's on Compose actually. I will try to look at it with JVisualVM as well.Jaehne
@Jaehne what is your RabbitMQ client library version and are you using any wrapper libraries for managing connection e.g. spring-amqp.Vermicular
@KarolDowbecki thanks for the answer. My client version is 5.2.0 and the only dependency included is spring-boot-starter-websocket and I guess that doesn't include spring-amqp . In my case the issue is not Java threads being spawn immediately, but they are not being closed. Maybe I should switch the version to the latest and retry again.Jaehne
did you find a solution to this issue I'm facing the same issue? Thanks for the help in advanceRosyrot
@SulimanAlzamel Yes I did but in a completely separate question. Here is the solution: https://mcmap.net/q/1012917/-spring-boot-ssl-tcpclient-stompbrokerrelaymessagehandler-activemq-undertowJaehne
@Jaehne thanksRosyrot
I
0

If you always use try-with-resource or close your opened files in a finally block, you may have really exceeded your file descriptor limit and you need another host to accept your requests. For this, you need to scale your application and load balance it. I suggest that you deploy rabbitmq in a cluster to zero in on this issue.

There are cases that RabbitMQ disregards your file descriptor limits.

RabbitMQ doc system limit screengrab

Indention answered 22/9, 2018 at 4:17 Comment(4)
Hi. Thanks for the answer. I added more details about the context. So we already have load balancer and the RabbitMQ works on Compose and on different machine with the application itself. How we reproduce the issue is by disregarding the DISCONNECT message on websockets. If you don't do it, then the websockets stay open forever, which sucks. We are looking for a way to solve it from the backend.Jaehne
@Jaehne My apologies for that. Just a question. Are you sending heartbeats from both server and client? And why does your stompBrokerRelay connected in 127.0.0.1 in the log that you've shown? Is that for demo purpose only?Indention
the servers are behind nginx reverse proxy. That's why they are seen as localhost ip. I didn't implement anything explicitly for the heartbeat however I see the HEARTBEAT messages when the client js is open on the browser. Heartbeats die after the client is closed. No heartbeats on the server side. Would that solve?Jaehne
An additional information: Queues are being created as Auto-delete on rabbitmq. Even the queues are closed after a while and the browser connection is closed, I can still see that Java threads are being hanging there. My apologies if my words are dummy for this topic since I am a newbie on this stuff.Jaehne
V
0

RabbitMQ Java client library from time to time has issue with managing open file descriptors. It's rarely bad but there are gotchas e.g. ChannelManager line 218.

You want to try few different Java client library versions as this is a client side issue. In one version I had thousands of Java threads being spawned due to error in connection creation (not sure which version was affected, I spotted this by using FlightRecorder and going to locks section, all threads were waiting to acquire a RabbitMQ connection(?) class lock).

Vermicular answered 24/9, 2018 at 15:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.