Play framework/Netty does not release socket
Asked Answered
B

2

26

Have already asked similar question, but deleted it, since thought I fixed it, but I was wrong.

I'm using Play framework in production for one of my web projects. From time to time Play does not render main page or does not return some of the static content files.

First screenshot displays firebug console, loading of the site is stucked at the beginning, when serving home page. enter image description here Second screenshot display fiddler console, when 2 static resources are not loading.

enter image description here

Initially application runs fine, it has to work for 5-7 days and I can see this issue. It is hard to reproduce, it happens 1 of 15 time, I have to delete cache data and reload page. (pressing CRTL-F5 in FF). Issue can be reproduced in most of the browsers from different machines and OS. Initially, I was thinking that there are some problems with hosting provider. But I have changed it and issue has not gone.

Version of the play is 1.2.5. Tried 1.2.2 as well. Play is running as standalone server on CentOS-5-32 bits.

I suspect, that there are some problems with Netty which are used by Play framework. Netty 3.5.7 final jar is used by Play.

cd /proc/28761/fd
ls -l | wc -l
337

For few days number of opened file descriptor grows from 140 to 350. Note, that the the average load to website in the beginning and later is the same.

I can see a lot of sockets opened by process, which are not released later.

lrwx------ 1 root root 64 Nov 11 10:34 300 -> socket:[1079566]
lrwx------ 1 root root 64 Nov 11 10:34 301 -> socket:[1079568]
lrwx------ 1 root root 64 Nov 11 10:34 302 -> socket:[1149958]
lrwx------ 1 root root 64 Nov 11 10:34 303 -> socket:[1160807]
lrwx------ 1 root root 64 Nov 11 10:34 304 -> socket:[1160605]
lrwx------ 1 root root 64 Nov 11 10:34 305 -> socket:[1157435]
lrwx------ 1 root root 64 Nov 11 10:34 306 -> socket:[1160607]
lrwx------ 1 root root 64 Nov 11 10:34 307 -> socket:[1160609]
lrwx------ 1 root root 64 Nov 11 10:34 308 -> socket:[1155542]
lrwx------ 1 root root 64 Nov 11 10:34 309 -> socket:[1120231]

Update

Number of opened TCP connection on the start of application (few hours of running) is 63.

Total: 150 (kernel 181)
TCP:   63 (estab 38, closed 5, orphaned 0, synrecv 0, timewait 3/0), ports 44

Transport Total     IP        IPv6
*         181       -         -
RAW       0         0         0
UDP       7         4         3
TCP       58        9         49
INET      65        13        52
FRAG      0         0         0

After 2 days of running, number of open TCP connection is 490.

[root@82711-2 fd]# ss -s
Total: 459 (kernel 490)
TCP:   378 (estab 271, closed 23, orphaned 0, synrecv 0, timewait 9/0), ports 37

Transport Total     IP        IPv6
*         490       -         -
RAW       0         0         0
UDP       7         4         3
TCP       355       12        343
INET      362       16        346
FRAG      0         0         0

All of this opened TCP connection are http connection (not database or any others). Average load on the website is the same all time, but number of opened files descriptors and opened sockets is growing all the time until too many open files exception

Initially application is started with 9-15 New I/O Threads (Netty workers). All of Netty threads are in Running state most of the time. And ~16 play threads which are in Wait state.

After few days of running number of Netty workers became 27. I'm not Netty expert, not sure if it is normal behaviour.

Few threads went deadlock: 1 Play thread and 1 Netty thread. Also there is another Play thread which is locked by first Play thread. So 3 locked threads in total. I'm sure, that this deadlocks are not the root cause of the issue, but the root cause can be the same

Name: New I/O  worker #21
State: BLOCKED on org.jboss.netty.handler.stream.ChunkedWriteHandler@15e057 owned by: play-thread-2
Total blocked: 44  Total waited: 9

Stack trace: 
org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:188)
org.jboss.netty.handler.stream.ChunkedWriteHandler.handleUpstream(ChunkedWriteHandler.java:140)
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:792)
org.jboss.netty.channel.SimpleChannelUpstreamHandler.channelClosed(SimpleChannelUpstreamHandler.java:212)
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:93)
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:792)
org.jboss.netty.handler.codec.replay.ReplayingDecoder.cleanup(ReplayingDecoder.java:636)
org.jboss.netty.handler.codec.replay.ReplayingDecoder.channelClosed(ReplayingDecoder.java:533)
org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:93)
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
org.jboss.netty.channel.Channels.fireChannelClosed(Channels.java:476)
org.jboss.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:631)
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:109)
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:66)
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:780)
org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:55)
org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:785)
org.jboss.netty.handler.stream.ChunkedWriteHandler.handleDownstream(ChunkedWriteHandler.java:111)
org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
org.jboss.netty.channel.Channels.close(Channels.java:821)
org.jboss.netty.channel.AbstractChannel.close(AbstractChannel.java:194)
org.jboss.netty.channel.ChannelFutureListener$1.operationComplete(ChannelFutureListener.java:41)
org.jboss.netty.channel.DefaultChannelFuture.notifyListener(DefaultChannelFuture.java:399)
org.jboss.netty.channel.DefaultChannelFuture.notifyListeners(DefaultChannelFuture.java:385)
org.jboss.netty.channel.DefaultChannelFuture.setSuccess(DefaultChannelFuture.java:334)
org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:493)
   - locked java.lang.Object@3b7e28
org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromTaskLoop(AbstractNioWorker.java:431)
org.jboss.netty.channel.socket.nio.AbstractNioChannel$WriteTask.run(AbstractNioChannel.java:364)
org.jboss.netty.channel.socket.nio.AbstractNioWorker.processWriteTaskQueue(AbstractNioWorker.java:349)
org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:245)
org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:38)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:102)
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

Second thread:

Name: play-thread-2
State: BLOCKED on java.lang.Object@3b7e28 owned by: New I/O  worker #21
Total blocked: 23  Total waited: 34 778

Stack trace: 
org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:654)
org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:408)
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.handleAcceptedSocket(NioServerSocketPipelineSink.java:127)
org.jboss.netty.channel.socket.nio.NioServerSocketPipelineSink.eventSunk(NioServerSocketPipelineSink.java:66)
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:780)
org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:63)
org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:785)
org.jboss.netty.channel.Channels.write(Channels.java:733)
org.jboss.netty.handler.stream.ChunkedWriteHandler.flush(ChunkedWriteHandler.java:262)
   - locked org.jboss.netty.handler.stream.ChunkedWriteHandler@15e057
org.jboss.netty.handler.stream.ChunkedWriteHandler.handleDownstream(ChunkedWriteHandler.java:121)
org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591)
org.jboss.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582)
org.jboss.netty.channel.Channels.write(Channels.java:712)
org.jboss.netty.channel.Channels.write(Channels.java:679)
org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:245)
play.server.PlayHandler.serveStatic(PlayHandler.java:886)
play.server.PlayHandler$NettyInvocation.init(PlayHandler.java:182)
play.Invoker$Invocation.run(Invoker.java:276)
play.server.PlayHandler$NettyInvocation.run(PlayHandler.java:229)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
java.util.concurrent.FutureTask.run(FutureTask.java:138)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
java.lang.Thread.run(Thread.java:662)

Update

I deployed same Play application to same environment to Tomcat 7. 24 hours are passed and the problem has gone, number of opened TCP connections remains constant. Number of opened files descriptors is not exceeding ~70. This is same production hosts, same database and same users of the application.

Blumenthal answered 11/11, 2012 at 12:3 Comment(11)
Do the requests hit the Play app directly or do they go through a proxy (Apache, etc)?Inlier
Not using proxy, requests go directly to PlayBlumenthal
did you do any thread dump on the server side, maybe there is something hanged on the server preventing play to reuse the threadCommunicate
Have you been able to reproduce this with a load test using ApacheBench or something similar?Inlier
I created a simple Play 1.2.5 app and run ApacheBench against it with 1m requests. The number of open file descriptors was the same after the test. So I'm guessing this is something outside of Play, like something in database code. You will probably have to find some way to reproduce the issue (like ApacheBench) to be able to start narrowing down where the problem is.Inlier
@James, if it is NIO bug, it could be platform dependent and hard to reproduce, hence you might not able to repeat it. I'm absolutely sure, there is not bug in my application code and there is no issues with database. I have updated my answerBlumenthal
Can you reproduce this on your production, staging or dev systems with ApacheBench? I'd run ab against a few different URLs (controllers that use DB calls, controllers that don't, static assets, etc). And start to narrow down where the issue is.Inlier
I have not tried stress test yet, cause of lack of the time and I have only prod environment =) But I can definitely say that db calls can't affect it. There is no threads stucked on db calls, db calls can't affect serving of static contents. All open TCP connection are http connection, I can see it by filtering them by ports number.Blumenthal
can you share the full thread-dump where you saw the blocked threads?Honorarium
@Norman Sadly, I copied stacktrace only of 2 processes which went deadlock.Blumenthal
Looks like deadlocks in ChunkedWriteHandler were a recurring issue sometime ago. one, two, three. Might be a good idea to report this as a bug.Lecythus
G
1

I actually encountered a similar bug not in play, but in the JVM (which play runs on) Whereby closed channels pointing to filehandles are not released until forced to by closing the JVM. Alas, I cannot remember how I found the bug report, or I'd link to it, but it is a known bug IN THE JVM. I ended up having to work around it. The best thing I can suggest is to rewrite your code to use the same channels/file handles as much as possible.

Gironde answered 4/12, 2012 at 20:15 Comment(1)
Seems unlikely as this would affect any kind of Java Web App, but this I'd no problem we're seeing on a regular basis.Fischer
F
1

There were several issues with deadlocks in the ChunkedWriteHandler. All of them seem to be resolved in your Netty version. Anyway, that piece of code seems to be attracting that kind of problems. I suggest you file an issue for the Netty guys.

https://issues.jboss.org/browse/NETTY-384

Also see "similar issues" to get an idea of how many issues there were regarding that class.

Fischer answered 8/12, 2012 at 18:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.