I've got a classic example of a Heisenbug that is triggered by a condition that I hadn't seen before. My legacy application (around 100K sloc of old code) fails to work properly in a specific instance and merely enabling JPDA to remote debug changes the behavior enough, causing the application to work properly: doing nothing but adding "-Xdebug -Xnoagent -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=6666" to the vm's command line hides the bug (with or without an actual connection). Given that I have a fully repeatable test case, I hate to perturb it much with code changes in case it goes back into hiding. And of course, this is happening in production only.
Usually, I'd immediately assume a threading problem, but a) the behavior is 100% fail vs 100% working and b) there is no explicit use of threads in the code path in question. Our team was then trying to come up with a list of other reasons for this behavior, so I thought perhaps Stack Overflow's group mind could add some more.
Heisenbugs in Java:
- Threads: bad synchronization, race conditions, implicit ordering assumptions.
- Explicit debugging/logging code: changes in code path cause/prevent the problem. Less frequently, changes in log level can result in timing changes (threading again) and differences in I/O resource use.
- Native code libraries can drag in non-java Heisenbug issues.
- Expecting finalizers to run predictably.
- improper assumptions about weak references.
- assume that a fixed-size cache never fills.
- expecting uniqueness of hashcodes.
- assumption that == works on Strings (or doesn't work on Strings that might be interned in some cases).
- VM bug (nah, that never happens ;).
- test methodology error(s). Especially when there are hidden variables that depend on test success. (this looks to be our actual problem. success of one test led to customer running next test, which failed because of policy issues. Failure led to running in debug mode according to policy, which resulted in success. sigh)
Any other cases worth exploring?
Edits:
- yes, the JPDA enable code uses old syntax. I have not tested to see if using modern syntax also changes the behavior.
- This specific machine is using 1.8.0_45-b14 for the JRE, and HotSpot 64-bit Server VM (build 25.45-b02)
- while the question is intended to be general, the instigating issue is real and current. Since the problem is manifesting in a deployed system, I'm torn between wanting to leave it running with -Xdebug as a workaround so that it stays operational and wanting to track down the underlying bug and kill it.
- the malfunctioning program in question is part of a multi-step data processing pipeline - the details shouldn't matter, but can best be understood as a standalone application that gets some information from a database, then uses it to modify some files. The part of the system that is breaking appears to be that information from the database is not being interpreted properly - anything from a broken object ORM or cache. When it is "broken", the application logic that determines if it has work to do (based on the contents of the db) makes the wrong choice for all iterations (thousands of iterations including multiple invocations of the program). When it is "working" (the only difference is the vm is running with -Xdebug or not), the application makes the correct choices for all iterations. It is completely consistent in this configuration. The same code running against different databases does not fail. There is some evidence (predating my involvement with this code) that similar behavior has been seen in the past that mysteriously began working after seemingly minor code-changes... see "Heisenbug"