javaspring-bootmultithreadingjava-21virtual-threads

Spring Boot application gets stuck when virtual threads are used on Java 21


I've run into an issue related to performance of Spring Boot 3 application with virtual threads enabled.

TL;DR reproduction steps:

  1. Checkout https://github.com/stsypanov/concurrency-demo
  2. Run DependencyApplication and ConcurrencyDemoApplication
  3. When both apps are up run StuckApplicationTest
  4. It will take about 1-2 minutes for the test to complete
  5. Now go to demo-service/application.yml and set spring.threads.virtual.enabled: true (by default it's false).
  6. Restart ConcurrencyDemoApplication
  7. Run StuckApplicationTest again
  8. Run YourKit profiler and connect to ConcurrencyDemoApplication, almost immediately you'll see a warning message about potential deadlock and the application itself hangs as all threads of its ForkJoinPool have Waiting status.

More detailed description:

General set-up:

To reproduce the issue you need a test doing not that many concurrent calls to a Spring Boot -based microservice, e.g. calling /actuator/health:

@Test
void run() {
    var restTemplate = new RestTemplate();
    var latch = new CountDownLatch(1);
    var executor = Executors.newVirtualThreadPerTaskExecutor();
    try (executor) {
        for (int i = 0; i < 100; i++) {
            executor.submit(() -> {
                waitOn(latch);
                assertNotNull(restTemplate.getForEntity("http://localhost:8081/actuator/health", ResponseEntity.class));
            });
        }
        latch.countDown();
    }
    assertTrue(executor.isTerminated());
}

private static void waitOn(CountDownLatch latch) {
    try {
        latch.await();
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }
}

This test sends GET requests to, let's say, Service A. In Service A I have a pretty simple instance of HealthIndicator and Feign client:

@Component
@RequiredArgsConstructor
public class DownstreamServiceHealthIndicator implements HealthIndicator {
    private final HealthClient healthClient;

    @Override
    public Health health() {
        var response = healthClient.checkHealth();
        if (response.getStatusCode().is2xxSuccessful()) {
            return new Health.Builder().up().build();
        }
        return new Health.Builder().down().withDetails(Map.of("response", response)).build();
    }
}

@FeignClient(name = "healthClient", url = "http://localhost:8087/actuator/health", configuration = InternalFeignConfiguration.class)
public interface HealthClient {
    @GetMapping
    ResponseEntity<String> checkHealth();
}

public class InternalFeignConfiguration {
    @Bean
    public Client client() {
        return new ApacheHttpClient(HttpClients.createDefault());
    }
}

Again, the test concurrently calls /actuator/health of Service A via RestTemplate and Service A calls /actuator/health of Service B via Feign client. Service B consists of the application main class and application.yml (see the code in the repository specified above), declaring health endpoint.

When you run the system with default settings it's OK. It takes ~1,5 second for the test to complete, but everything else is fine.

However, having virtual threads on, the Service A gets stuck and if you connect to it with e.g. YourKit profiler you'll get a warning message about potential deadlock with this stack trace:

+-----------------------------------------------------------------------------------------------------------------------------+
|                                                            Name                                                             |
+-----------------------------------------------------------------------------------------------------------------------------+
|  +---Read-Updater Frozen for at least 10s <Ignore a false positive>                                                         |
|  | +---jdk.internal.misc.Unsafe.park(boolean, long) Unsafe.java (native)                                                    |
|  | +---java.util.concurrent.locks.LockSupport.park() LockSupport.java:371                                                   |
|  | +---java.util.concurrent.LinkedTransferQueue$DualNode.await(Object, long, Object, boolean) LinkedTransferQueue.java:458  |
|  | +---java.util.concurrent.LinkedTransferQueue.xfer(Object, long) LinkedTransferQueue.java:613                             |
|  | +---java.util.concurrent.LinkedTransferQueue.take() LinkedTransferQueue.java:1257                                        |
|  | +---sun.nio.ch.Poller.updateLoop() Poller.java:286                                                                       |
|  | +---sun.nio.ch.Poller$$Lambda.0x0000024081474670.run()                                                                   |
|  | +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596                                                          |
|  | +---java.lang.Thread.run() Thread.java:1583                                                                              |
|  | +---jdk.internal.misc.InnocuousThread.run() InnocuousThread.java:186                                                     |
|  |                                                                                                                          |
|  +---spring.cloud.inetutils Frozen for at least 10s <Ignore a false positive>                                               |
|  | +---java.net.Inet6AddressImpl.getHostByAddr(byte[]) Inet6AddressImpl.java (native)                                       |
|  | +---java.net.InetAddress$PlatformResolver.lookupByAddress(byte[]) InetAddress.java:1225                                  |
|  | +---java.net.InetAddress.getHostFromNameService(InetAddress, boolean) InetAddress.java:840                               |
|  | +---java.net.InetAddress.getHostName(boolean) InetAddress.java:782                                                       |
|  | +---java.net.InetAddress.getHostName() InetAddress.java:754                                                              |
|  | +---org.springframework.cloud.commons.util.InetUtils$$Lambda.0x0000024081187240.call()                                   |
|  | +---java.util.concurrent.FutureTask.run() FutureTask.java:317                                                            |
|  | +---java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) ThreadPoolExecutor.java:1144            |
|  | +---java.util.concurrent.ThreadPoolExecutor$Worker.run() ThreadPoolExecutor.java:642                                     |
|  | +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596                                                          |
|  | +---java.lang.Thread.run() Thread.java:1583                                                                              |
|  |                                                                                                                          |
|  +---Write-Updater Frozen for at least 10s <Ignore a false positive>                                                        |
|    +---jdk.internal.misc.Unsafe.park(boolean, long) Unsafe.java (native)                                                    |
|    +---java.util.concurrent.locks.LockSupport.park() LockSupport.java:371                                                   |
|    +---java.util.concurrent.LinkedTransferQueue$DualNode.await(Object, long, Object, boolean) LinkedTransferQueue.java:458  |
|    +---java.util.concurrent.LinkedTransferQueue.xfer(Object, long) LinkedTransferQueue.java:613                             |
|    +---java.util.concurrent.LinkedTransferQueue.take() LinkedTransferQueue.java:1257                                        |
|    +---sun.nio.ch.Poller.updateLoop() Poller.java:286                                                                       |
|    +---sun.nio.ch.Poller$$Lambda.0x0000024081474670.run()                                                                   |
|    +---java.lang.Thread.runWith(Object, Runnable) Thread.java:1596                                                          |
|    +---java.lang.Thread.run() Thread.java:1583                                                                              |
|    +---jdk.internal.misc.InnocuousThread.run() InnocuousThread.java:186                                                     |
+-----------------------------------------------------------------------------------------------------------------------------+

As it appears from the above, the application's bottleneck is in this method:

// class org.springframework.cloud.commons.util.InetUtils

public HostInfo convertAddress(final InetAddress address) {
    HostInfo hostInfo = new HostInfo();
    Future<String> result = this.executorService.submit(address::getHostName); // <---

    String hostname;
    try {
        hostname = result.get(this.properties.getTimeoutSeconds(), TimeUnit.SECONDS); // <---
    }
    catch (Exception e) {
        this.log.info("Cannot determine local hostname");
        hostname = "localhost";
    }
    hostInfo.setHostname(hostname);
    hostInfo.setIpAddress(address.getHostAddress());
    return hostInfo;
}

and the root cause is likely to be SingleThreadExecutor used to detect host name when doing a remote call:

public InetUtils(final InetUtilsProperties properties) {
    this.properties = properties;
    this.executorService = Executors.newSingleThreadExecutor(r -> {
        Thread thread = new Thread(r);
        thread.setName(InetUtilsProperties.PREFIX);
        thread.setDaemon(true);
        return thread;
    });
}

In practice that means only one task can be executed at a moment, i.e. when parallel threads are doing simultaneous calls they resolve the host name sequentially!

What puzzles me is the fact that if I turn the virtual threads off there'll be no hung in the application.

Can anyone explain whether my theory about the root cause of the issue is correct and why the application counter-intuitively survives on platform (heavyweight) threads and dies on virtual (i.e. lightweight) ones?


Solution

  • My first assumption was wrong, this issue is not about InetUtils. This is partially a library, partially a JDK issue.

    Running the application with -Djdk.tracePinnedThreads=full I've received a stack trace of pinned threads, it's quite long and available on https://github.com/stsypanov/concurrency-demo/blob/master/pinned-threads.txt.

    There pay attention to lines 12 and 92:

    12 org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:319)
    92 org.apache.http.pool.AbstractConnPool.getPoolEntryBlocking(AbstractConnPool.java:391)
    

    Investigating the stack trace we find out that eventually the threads get stuck in org.apache.http.pool.AbstractConnPool which is a connection pool used by org.apache.http.impl.client.InternalHttpClient.

    The pool can be instantiated only with maxTotal = 20 (maximum number of connections), to me it's a bug, but anyway, By default ForkJoinPool of Tomcat is constructed with parallelism (i.e. carrier thread count) 20, so let's verify whether the application will survive with smaller number of connections. In StuckApplicationTest we reduce the number of threads down to 20 and indeed the tests does not hang any more even having virtual threads enabled on the server side.

    This part is clear now, so if you face the same obstacle you need either to use feign's default HTTP client, or migrate to org.apache.httpcomponents.client5:httpclient5.

    But why does the same code work with platform threads? I've asked this question in loom-dev mailing list and got the answer.

    As Alan Bateman mentioned:

    Further up in the stack is AbstractConnPool.get with the synchronized block that is causing the issue here. A limitation right now is that a virtual thread cannot unmount when inside a synchronized statement or method. In this case it seems that all carriers are pinned by virtual threads that are all blocked in the same place. The other short term fix is to this to migrate this code to also use a j.u.concurrent lock.

    There is work under way to remove this limitation. There are EA builds available that we looking for help testing before the changes are proposed for the main line and a future JDK release.

    Here's the buggy snippet of AbstractConnPool:

    @Override
    public E get(final long timeout, final TimeUnit timeUnit) {
      for (;;) {
        synchronized (this) { // here we cannot unmount VT
          try {
            final E leasedEntry = getPoolEntryBlocking(); // here VT are blocked
            return leasedEntry;
          } catch (final IOException ex) {}
        }
      }
    }
    

    I've checked out the EAP build from https://jdk.java.net/loom/, run the application on it and there is no hanging now :)

    I'm not the only one facing the problems reproducible with VT (see e.g. https://mail.openjdk.org/pipermail/loom-dev/2024-July/006939.html), so I hope this will help to work similar issues around.

    P.S. As @silverfox pointed out

    Apache HttpComponents Client 5.4-beta1 is released and compatible with Java Virtual Threads and Java 21 Runtime.

    P.S.S. Another work-around is inflation of Tomcat's ForkJoinPool, see Can I configure parallelism of ForkJoinPool created for Tomcat in Spring Boot application?. Pay attention, that setting bigger number of connections won't help.

    P.S.S.S. Java 24 is likely to be released in March 2025 and this would be the final solution for this issue.