How can I debug a non-responsive server, when the profiler can't collect samples?
Asked Answered
C

3

6

I have been having occasional problems with a server I wrote. It's in Clojure, but I don't think that matters, and we can pretend it's in Java. Anyway, it works fine for hours at a time, but goes into fits where it behaves very badly: all activity stops, for around fifteen seconds, and then it works normally for a few seconds, then stops for fifteen seconds...and so on for (usually) about ten minutes or so, after which it goes back to behaving normally.

I've done a lot of profiling of it with YourKit, and I've ruled out a number of plausible suspects:

  • It's not a garbage collection issue: I'm running it with -XX:+UseConcMarkSweepGC, and I've verified that the server continues to run just fine during both minor and major collections, due to the concurrent nature of this garbage collector. And we're not thrashing as we run out of total memory or something: the current heap size is well below its max.

  • I don't think it's a locking/synchronization issue, but I'm not 100% sure on that. The YourKit profiler shows threads waiting sometimes, eg competing over the lock for System.out to produce log messages, but the only long waits are for worker threads in threadpools when there's nothing to do. And of course YourKit says it's never detected any deadlocks.

  • It's not something caused by having the profiler attached, because it still happens even if I boot the server up and then leave it alone without ever attaching the profiler.

  • It's not some other process on the system taking up all the CPU time: top shows CPU usage at 100% for my java process, and basically 0% for everything else.

My biggest problem is that I can't see what the server is doing during these strange funks, because the profiler stops receiving samples. Here's a graph of the CPU usage chart:

YourKit CPU-graph screenshot

The left side of the graph is normal operation, during which we get profiler samples every second or so. The right side is "broken", and is very spiky because the profiler is only getting samples every ten seconds or so. In the samples it does get, the server seems to be doing its usual business: responding to requests and so on; and the logs confirm that it is doing normal stuff, but only at the times the profiler has samples for: during the upward-sloping "straight lines" on the graph, for which the profiler has no samples, the server is doing nothing at all.

So, does this graph look familiar to anyone? Have you had this problem before and fixed it? Or can you point me in the direction of a tool that can figure out what my server is doing during the time when YourKit can't? In case it matters, the server machine is running Ubuntu 10.04, and

$ java -version
java version "1.6.0_22"
OpenJDK Runtime Environment (IcedTea6 1.10.10) (rhel-1.28.1.10.10.el5_8-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b11, mixed mode)
Cowling answered 2/5, 2013 at 20:30 Comment(5)
This might be a gigantic pain in the ass, but you could but debug logging all throughout your code, then see what gets written to the log. Another possibility is that the problem is not your java program, but in fact some other job on the server that eats all the resources for 10 minutes.Follansbee
What you need to get is one (1) stack sample when it's hung, and then examine it and understand it. This isn't about measuring - it's about "why is it hung?" Of course, as durron597 said, it might not be your code at fault, so you might need a sample from all threads.Digitize
That's a good point, @durron, but nothing else interesting is running on this machine, and top shows the java process using 100% CPU during its "sad times". I'll edit that into the question. And I already have quite a bit of logging, as I mentioned: none of it happens when the server is stuck.Cowling
Even using CMS, stop-the-world full GC can kick in sometimes. Have you enabled/checked plain GC logs?Fusionism
@Fusionism Yes, and also watched, via the profiler, how the app behaves during GC. I am pretty confident it is not a GC-related problem.Cowling
F
3

Okay, from the comments it seems clear to me we are not going to be able to figure this out with the information you've given so far. The best we can do is to give suggestions on how to debug it...

I would try to use jstack during one of the spikes and see if you can use that to figure out where it hangs.

Follansbee answered 2/5, 2013 at 21:11 Comment(3)
I haven't used jstack before - does it tell you more than a simple thread dump will?Pirogue
@Pirogue yes, it does. Read the documentationFollansbee
Well, he was asking specifically for suggestions on debugging it.Vasta
S
1

If you have no chance to measure or debug in code try to look form the outside.

I would at first to try to reproduce the problem. In other words is there a external event that produce the behavior. Try to change the load on server. Switch every thing you can to reproduce the problem.

Maybe it's also a good idea to sniff the network traffic (tcpdump) to find something interesting around the time when you server hangs.

You can also run it on another operating system to check if it depends from your installation environment.

If you can't reproduce a situation where the problem occurs, try to find situations where you don't get the problem. For instance remove the server from net. Shutdown all other services.

If you can't find with that any change of behavior of your program try to reduce the complexity of your working code and see if you can find a internal module that seems to be related with the problem.

Sissy answered 3/5, 2013 at 14:5 Comment(0)
P
1

Have you had this problem before and fixed it? Or can you point me in the direction of a tool that can figure out what my server is doing during the time when YourKit can't?

If you have shell access on the server and can see stdout, try taking a thread dump when the server becomes unresponsive. Not sure if this will give you anything different than what jstack (mentioned in the other answer) would give you or not.

On Ubuntu: kill -QUIT <java-pid> (will not actually kill the Java process).

http://www.crazysquirrel.com/computing/java/basics/java-thread-dump.jspx

Pirogue answered 3/5, 2013 at 14:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.