mavenmaven-surefire-pluginmaven-failsafe-plugin

Killing self fork JVM. PING timeout elapsed


EDIT: I just wanted to clarify that, in the interest of asking a question that has a known, objective answer, this question is, What does "Killing self fork JVM. PING timeout elapsed" actually mean, e.g. what is pinging what, and why failsafe decide it should exit the test process? Since this is StackOverflow, please don't respond with advice that fixes some VM exits, especially those that result in different behavior from what we see below. E.g. there is no OutOfMemoryError in the console, so I don't think the VM is running out of heap space. If you do answer that way, an SO admin may misunderstand my question as a result and lock it or close it.

We sometimes have a VM crash in a CI build, e.g.:

[INFO] Results:
[INFO] 
[WARNING] Tests run: 8152, Failures: 0, Errors: 0, Skipped: 31
...
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-failsafe-plugin:2.22.1:verify (integration-test) on project app_server: There are test failures.
[ERROR] 
[ERROR] Please refer to /builds/App/Development/App/app_server/target/surefire-reports for the individual test results.
[ERROR] Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
[ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
[ERROR] Command was /bin/sh -c cd /builds/App/Development/App/app_server && /usr/lib/jvm/java-1.8-openjdk/jre/bin/java -Xmx3g -jar /builds/App/Development/App/app_server/target/surefire/surefirebooter7662621916357034130.jar /builds/App/Development/App/app_server/target/surefire 2019-01-09T21-23-07_397-jvmRun1 surefire1770987927673067492tmp surefire_37459604808221437221tmp
[ERROR] Error occurred in starting fork, check output in log
[ERROR] Process Exit Code: 1
[ERROR] Crashed tests:
[ERROR] com.company.blah.blah.ITSomeIntegrationTests
[ERROR]     at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:669)
[ERROR]     at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:282)
[ERROR]     at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:245)
[ERROR]     at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
[ERROR]     at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
[ERROR]     at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
[ERROR]     at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
[ERROR]     at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
[ERROR]     at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:154)
[ERROR]     at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:146)
[ERROR]     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
[ERROR]     at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
[ERROR]     at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
[ERROR]     at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
[ERROR]     at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
[ERROR]     at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
[ERROR]     at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
[ERROR]     at org.apache.maven.cli.MavenCli.execute(MavenCli.java:954)
[ERROR]     at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
[ERROR]     at org.apache.maven.cli.MavenCli.main(MavenCli.java:192)
[ERROR]     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[ERROR]     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[ERROR]     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[ERROR]     at java.lang.reflect.Method.invoke(Method.java:498)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
[ERROR]     at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)

(Of course, "There are test failures" is wrong, as there were no failed tests.)

What I want to know first is, what is failsafe trying to tell us? Here's some information we gather:

First, there is no stack dump or heap dump, but there are .dump files left behind by surefire and failsafe. For the failed project, there's always a .dump file like:

# Created at 2019-02-12T14:31:16.410
System.exit() or native command error interrupted process checker.
java.lang.IllegalStateException: Cannot use PPID 158 process information. Going to use NOOP events.
    at org.apache.maven.surefire.booter.PpidChecker.checkProcessInfo(PpidChecker.java:155)
    at org.apache.maven.surefire.booter.PpidChecker.isProcessAlive(PpidChecker.java:124)
    at org.apache.maven.surefire.booter.ForkedBooter$2.run(ForkedBooter.java:214)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

# Created at 2019-02-12T14:47:59.174
Killing self fork JVM. PING timeout elapsed.

(Is this a breakdown in communication between the failsafe and Maven processes?)

Also, we print the stack trace when System.exit() is called, and in every such failure it looks like:

java.lang.Exception: System.exit() or similar method called:
    at com.app.IntegrationTestSetup$1.checkPermission(IntegrationTestSetup.java:78)
    at java.lang.SecurityManager.checkExit(SecurityManager.java:761)
    at java.lang.Runtime.halt(Runtime.java:273)
    at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:311)
    at org.apache.maven.surefire.booter.ForkedBooter.kill(ForkedBooter.java:305)
    at org.apache.maven.surefire.booter.ForkedBooter.access$300(ForkedBooter.java:68)
    at org.apache.maven.surefire.booter.ForkedBooter$5.run(ForkedBooter.java:285)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

i.e. some surefire code (used by the failsafe plugin, in this case) killed the failsafe process (the JVM that surefire / failsafe starts for tests in a process that is a child of the Maven process).

We're using version 2.22.1 of both failsafe and surefire.

# mvn -v
Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z)
Maven home: /usr/share/java/maven-3
Java version: 1.8.0_191, vendor: Oracle Corporation, runtime: /usr/lib/jvm/java-1.8-openjdk/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-45-generic", arch: "amd64", family: "unix"

Solution

  • All of this is discussed in mailing list, but I'll summarize below.

    Background

    By default (and in our case), Maven, running in one JVM, forks a child process running another JVM to run tests. The parent gives the child commands via the child's stdin. Specifically, the parent sends a NOOP command to the child to let it know it's still alive.

    Answer

    ForkedBooter.java sets up two threads indirectly. commandReader reads commands from the parent process via stdin. listenToShutdownCommands adds a listener to commandReader to set an AtomicBoolean pingDone to true when the NOOP command is received. listenToShutdownCommands also schedules a job to run every 30s that does something like this (modified for easy reading):

    boolean hasPing = pingDone.getAndSet( false );
    if ( !hasPing ) {
        exit( 1 );
        log( "Killing self fork JVM. PING timeout elapsed." );
    

    So the error message is claiming that the child did not read a NOOP from the parent.

    Cause of the error

    You may be able to predict what the problem is from the above description alone. I added logging to see what was going on in my case, and found that sometimes there would a pause minutes long during which commandReader would not read any NOOPs (and usually pingJob would not run either). When there was time for the two to run, pingJob may run twice in a row thereafter before commandReader gets its turn.

    Simply, there is nothing in this code that ensures the OS will run the thread that reads from stdin often enough. It's possible for there to be a 3m pause in one thread because we've asked the OS to run a dozen other threads with equal priority, all with things to do--they're not sleeping, yielding, or blocking for IO. We have a heavyweight test that did indeed have a couple of 3m pauses, even on a 4-core processor.