Killing self fork JVM. PING timeout elapsed
Asked Answered
T

3

5

EDIT: I just wanted to clarify that, in the interest of asking a question that has a known, objective answer, this question is, What does "Killing self fork JVM. PING timeout elapsed" actually mean, e.g. what is pinging what, and why failsafe decide it should exit the test process? Since this is StackOverflow, please don't respond with advice that fixes some VM exits, especially those that result in different behavior from what we see below. E.g. there is no OutOfMemoryError in the console, so I don't think the VM is running out of heap space. If you do answer that way, an SO admin may misunderstand my question as a result and lock it or close it.

We sometimes have a VM crash in a CI build, e.g.:

[INFO] Results:
[INFO] 
[WARNING] Tests run: 8152, Failures: 0, Errors: 0, Skipped: 31
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-failsafe-plugin:2.22.1:verify (integration-test) on project app_server: There are test failures.
[ERROR] 
[ERROR] Please refer to /builds/App/Development/App/app_server/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /builds/App/Development/App/app_server && /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xmx3g -jar /builds/App/Development/App/app_server/target/surefire/surefirebooter7662621916357034130.jar /builds/App/Development/App/app_server/target/surefire 2019-01-09T21-23-07_397-jvmRun1 surefire1770987927673067492tmp surefire_37459604808221437221tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] com.company.blah.blah.ITSomeIntegrationTests
[ERROR]     at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:669)
[ERROR]     at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:282)
[ERROR]     at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:245)
[ERROR]     at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
[ERROR]     at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
[ERROR]     at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
[ERROR]     at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
[ERROR]     at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
[ERROR]     at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
[ERROR]     at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
[ERROR]     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
[ERROR]     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
[ERROR]     at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
[ERROR]     at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
[ERROR]     at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
[ERROR]     at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
[ERROR]     at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
[ERROR]     at org.apache.maven.cli.MavenCli.execute(MavenCli.java:954)
[ERROR]     at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
[ERROR]     at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
[ERROR]     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR]     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[ERROR]     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[ERROR]     at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

(Of course, "There are test failures" is wrong, as there were no failed tests.)

What I want to know first is, what is failsafe trying to tell us? Here's some information we gather:

First, there is no stack dump or heap dump, but there are .dump files left behind by surefire and failsafe. For the failed project, there's always a .dump file like:

# Created at 2019-02-12T14:31:16.410
System.exit() or native command error interrupted process checker.
java.lang.IllegalStateException: Cannot use PPID 158 process information. Going to use NOOP events.
    at org.apache.maven.surefire.booter.PpidChecker.checkProcessInfo(PpidChecker.java:155)
    at org.apache.maven.surefire.booter.PpidChecker.isProcessAlive(PpidChecker.java:124)
    at org.apache.maven.surefire.booter.ForkedBooter$2.run(ForkedBooter.java:214)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

# Created at 2019-02-12T14:47:59.174
Killing self fork JVM. PING timeout elapsed.

(Is this a breakdown in communication between the failsafe and Maven processes?)

Also, we print the stack trace when System.exit() is called, and in every such failure it looks like:

java.lang.Exception: System.exit() or similar method called:
    at com.app.IntegrationTestSetup$1.checkPermission(IntegrationTestSetup.java:78)
    at java.lang.SecurityManager.checkExit(SecurityManager.java:761)
    at java.lang.Runtime.halt(Runtime.java:273)
    at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:311)
    at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:305)
    at org.apache.maven.surefire.booter.ForkedBooter.access$300(ForkedBooter.java:68)
    at org.apache.maven.surefire.booter.ForkedBooter$5.run(ForkedBooter.java:285)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

i.e. some surefire code (used by the failsafe plugin, in this case) killed the failsafe process (the JVM that surefire / failsafe starts for tests in a process that is a child of the Maven process).

We're using version 2.22.1 of both failsafe and surefire.

# mvn -v
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z)
Maven home: /usr/share/java/maven-3
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-1.8-openjdk/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-45-generic", arch: "amd64", family: "unix"
Throughput answered 18/2, 2019 at 21:46 Comment(3)
See also: stackoverflow.com/questions/23260057Bromberg
For discussion, I also asked on the Maven users mailing list: lists.apache.org/thread.html/…Throughput
I have this intermittent problem too - I suspect its a bug caused by some race condition as all our tests complete OK. We had forkCount set to run 6 parallel forks. When we cut to 4 the problem seems to go away. I annoying though, hopefully bug will get fixed so we can speed up again.Nymphalid
T
4

All of this is discussed in mailing list, but I'll summarize below.

Background

By default (and in our case), Maven, running in one JVM, forks a child process running another JVM to run tests. The parent gives the child commands via the child's stdin. Specifically, the parent sends a NOOP command to the child to let it know it's still alive.

Answer

ForkedBooter.java sets up two threads indirectly. commandReader reads commands from the parent process via stdin. listenToShutdownCommands adds a listener to commandReader to set an AtomicBoolean pingDone to true when the NOOP command is received. listenToShutdownCommands also schedules a job to run every 30s that does something like this (modified for easy reading):

boolean hasPing = pingDone.getAndSet( false );
if ( !hasPing ) {
    exit( 1 );
    log( "Killing self fork JVM. PING timeout elapsed." );

So the error message is claiming that the child did not read a NOOP from the parent.

Cause of the error

You may be able to predict what the problem is from the above description alone. I added logging to see what was going on in my case, and found that sometimes there would a pause minutes long during which commandReader would not read any NOOPs (and usually pingJob would not run either). When there was time for the two to run, pingJob may run twice in a row thereafter before commandReader gets its turn.

Simply, there is nothing in this code that ensures the OS will run the thread that reads from stdin often enough. It's possible for there to be a 3m pause in one thread because we've asked the OS to run a dozen other threads with equal priority, all with things to do--they're not sleeping, yielding, or blocking for IO. We have a heavyweight test that did indeed have a couple of 3m pauses, even on a 4-core processor.

Throughput answered 22/3, 2019 at 23:9 Comment(1)
Beyond the scope of the answer, I should add that a 3m pause on a 4-core processor is a little extreme for a unit test--sounds more like a perf or load test. If you're running a unit test, you can probably optimize the error away.Throughput
B
2

I've found two ways to avoid this problem, neither is really satisfactory.

  1. Downgrade surefire to 2.18.1
  2. Limit logging output in tests. For example, set the log level to WARN and not DEBUG.

I'd really like to know what would be the true solution to this problem though.

Bromberg answered 1/3, 2019 at 16:35 Comment(4)
Thanks. We do have a heck of a lot of logging output.Throughput
if you have any sources to cite or evidence to present as to why those solutions worked, that would of course be helpful. We don't encounter the problem often enough to be able to infer via trial-and-error that any particular solution actually works; we only know that a solution does not work when we experience a crash.Throughput
I have found this during my tests, but I don't have a minimal working/failing project to share yet. In my case redirecting stdout and stderr to files works around the issue, so I guess it's somehow related to console output (it's noticeably slower than usual too). After posting I noticed that you are on Linux and I'm on Windows 7 so we probably have hit different bugs.Bromberg
I solved this issue by upgrading surefire to 3.0.0-M5Cuttie
I
0

Another possibility if you're running alpine or busybox:

Our builds were failing with the error:

System.exit() or native command error interrupted process checker. java.lang.IllegalStateException: Cannot use PPID xxx process information. Going to use NOOP events. at org.apache.maven.surefire.booter.PpidChecker.checkProcessInfo(PpidChecker.java:155)

checking the code it looks like it runs ps -p:

https://github.com/apache/maven-surefire/blob/47eb1974ef2fb77c621e5cb0c47ac10ab8f4753d/surefire-booter/src/main/java/org/apache/maven/surefire/booter/PpidChecker.java#L186C59-L186C71

installing procps in alipine seems to fix it

I've raised a feature request: https://issues.apache.org/jira/browse/SUREFIRE-2187

Involution answered 12/7, 2023 at 8:23 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.