I have a nasty issue with load-balanced Tomcat servers that are hanging up. Any help would be greatly appreciated.
The system
I'm running Tomcat 6.0.26 on HotSpot Server 14.3-b01 (Java 1.6.0_17-b04) on three servers sitting behind another server that acts as load balancer. The load balancer runs Apache (2.2.8-1) + MOD_JK (1.2.25). All of the servers are running Ubuntu 8.04.
The Tomcat's have 2 connectors configured: an AJP one, and a HTTP one. The AJP is to be used with the load balancer, while the HTTP is used by the dev team to directly connect to a chosen server (if we have a reason to do so).
I have Lambda Probe 1.7b installed on the Tomcat servers to help me diagnose and fix the problem soon to be described.
The problem
Here's the problem: after about 1 day the application servers are up, JK Status Manager starts reporting status ERR
for, say, Tomcat2. It will simply get stuck on this state, and the only fix I've found so far is to ssh the box and restart Tomcat.
I must also mention that JK Status Manager takes a lot longer to refresh when there's a Tomcat server in this state.
Finally, the "Busy" count of the stuck Tomcat on JK Status Manager is always high, and won't go down per se -- I must restart the Tomcat server, wait, then reset the worker on JK.
Analysis
Since I have 2 connectors on each Tomcat (AJP and HTTP), I still can connect to the application through the HTTP one. The application works just fine like this, very, very fast. That is perfectly normal, since I'm the only one using this server (as JK stopped delegating requests to this Tomcat).
To try to better understand the problem, I've taken a thread dump from a Tomcat which is not responding anymore, and from another one that has been restarted recently (say, 1 hour before).
The instance that is responding normally to JK shows most of the TP-ProcessorXXX threads in "Runnable" state, with the following stack trace:
java.net.SocketInputStream.socketRead0 ( native code )
java.net.SocketInputStream.read ( SocketInputStream.java:129 )
java.io.BufferedInputStream.fill ( BufferedInputStream.java:218 )
java.io.BufferedInputStream.read1 ( BufferedInputStream.java:258 )
java.io.BufferedInputStream.read ( BufferedInputStream.java:317 )
org.apache.jk.common.ChannelSocket.read ( ChannelSocket.java:621 )
org.apache.jk.common.ChannelSocket.receive ( ChannelSocket.java:559 )
org.apache.jk.common.ChannelSocket.processConnection ( ChannelSocket.java:686 )
org.apache.jk.common.ChannelSocket$SocketConnection.runIt ( ChannelSocket.java:891 )
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run ( ThreadPool.java:690 )
java.lang.Thread.run ( Thread.java:619 )
The instance that is stuck shows most (all?) of the TP-ProcessorXXX threads in "Waiting" state. These have the following stack trace:
java.lang.Object.wait ( native code )
java.lang.Object.wait ( Object.java:485 )
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run ( ThreadPool.java:662 )
java.lang.Thread.run ( Thread.java:619 )
I don't know of the internals of Tomcat, but I would infer that the "Waiting" threads are simply threads sitting on a thread pool. So, if they are threads waiting inside of a thread pool, why wouldn't Tomcat put them to work on processing requests from JK?
EDIT: I don't know if this is normal, but Lambda Probe shows me, in the Status section, that there are lots of threads in KeepAlive
state. Is this somehow related to the problem I'm experiencing?
Solution?
So, as I've stated before, the only fix I've found is to stop the Tomcat instance, stop the JK worker, wait the latter's busy count slowly go down, start Tomcat again, and enable the JK worker once again.
What is causing this problem? How should I further investigate it? What can I do to solve it?
Thanks in advance.