Likely and unlikely causes of Heisenbugs in Java?

L

2

5

I've got a classic example of a Heisenbug that is triggered by a condition that I hadn't seen before. My legacy application (around 100K sloc of old code) fails to work properly in a specific instance and merely enabling JPDA to remote debug changes the behavior enough, causing the application to work properly: doing nothing but adding "-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=6666" to the vm's command line hides the bug (with or without an actual connection). Given that I have a fully repeatable test case, I hate to perturb it much with code changes in case it goes back into hiding. And of course, this is happening in production only.

Usually, I'd immediately assume a threading problem, but a) the behavior is 100% fail vs 100% working and b) there is no explicit use of threads in the code path in question. Our team was then trying to come up with a list of other reasons for this behavior, so I thought perhaps Stack Overflow's group mind could add some more.

Heisenbugs in Java:

Threads: bad synchronization, race conditions, implicit ordering assumptions.
Explicit debugging/logging code: changes in code path cause/prevent the problem. Less frequently, changes in log level can result in timing changes (threading again) and differences in I/O resource use.
Native code libraries can drag in non-java Heisenbug issues.
Expecting finalizers to run predictably.
improper assumptions about weak references.
assume that a fixed-size cache never fills.
expecting uniqueness of hashcodes.
assumption that == works on Strings (or doesn't work on Strings that might be interned in some cases).
VM bug (nah, that never happens ;).
test methodology error(s). Especially when there are hidden variables that depend on test success. (this looks to be our actual problem. success of one test led to customer running next test, which failed because of policy issues. Failure led to running in debug mode according to policy, which resulted in success. sigh)

Any other cases worth exploring?

Edits:

yes, the JPDA enable code uses old syntax. I have not tested to see if using modern syntax also changes the behavior.
This specific machine is using 1.8.0_45-b14 for the JRE, and HotSpot 64-bit Server VM (build 25.45-b02)
while the question is intended to be general, the instigating issue is real and current. Since the problem is manifesting in a deployed system, I'm torn between wanting to leave it running with -Xdebug as a workaround so that it stays operational and wanting to track down the underlying bug and kill it.
the malfunctioning program in question is part of a multi-step data processing pipeline - the details shouldn't matter, but can best be understood as a standalone application that gets some information from a database, then uses it to modify some files. The part of the system that is breaking appears to be that information from the database is not being interpreted properly - anything from a broken object ORM or cache. When it is "broken", the application logic that determines if it has work to do (based on the contents of the db) makes the wrong choice for all iterations (thousands of iterations including multiple invocations of the program). When it is "working" (the only difference is the vm is running with -Xdebug or not), the application makes the correct choices for all iterations. It is completely consistent in this configuration. The same code running against different databases does not fail. There is some evidence (predating my involvement with this code) that similar behavior has been seen in the past that mysteriously began working after seemingly minor code-changes... see "Heisenbug"

Luisluisa answered 8/1, 2016 at 14:51 Comment(5)

I'd bisect the flags if possible. The debug in particular makes me suspect the JIT. – International 8/1, 2016 at 15:5

This question could get some interesting information for many of us. Why would someone want to close it? – Jill 8/1, 2016 at 15:12

@m.thome, would you mind explaining a bit more precisely what you mean by "the behavior is 100% fail vs 100% working"? By this I mean what exactly is the behavior that is failing 100% of the time or passing 100% of the time? Also, what is your application (e.g. desktop, web service, standalone single-threaded command line app, etc.)? I'm not looking for confidential business information but a bit more context would help me narrow down some possible smoking guns in an answer. – Franklynfrankness 8/1, 2016 at 15:42

Also, which Java version are you using? – Franklynfrankness 8/1, 2016 at 15:56

Back to the old-school debugging I guess... logging the state in release mode? I mean, what else can you do? That's the best answer anyone can give I suppose. GL. – Fugleman 8/1, 2016 at 18:4

T

4

-Xdebug seems like a behavior changing switch. What are Java command line options to set to allow JVM to be remotely debugged? claims that adding it turns you from JIT to all interpreted. Other oracle java docs (for jrocket admittedly) seem to indicate that it's slower for some unspecified reason and it's not appropriate for deployed systems.

I can imagine different GC schemes maybe making changes.

Tillis answered 8/1, 2016 at 15:2 Comment(0)

J

3

I had a case where the failure was provoked by an energy saving feature on the hardware that was never activated when the bug was under study.

Jill answered 8/1, 2016 at 14:54 Comment(2)

That's interesting. What was the hardware and what was the energy saving feature? – Franklynfrankness 8/1, 2016 at 15:53

This was about 15 years ago. It was the thick client of a point of sale, running on a Compaq PC. Every time the operator left the PC, the energy saving feature activated (disks, monitor and processor) and the system hang up. We didn't fix it, just desactivated the energy saving. – Jill 8/1, 2016 at 16:0

Recommended topics

Hot tags