Java 21 virtual thread executor performing worse than executor with pooled OS threads?
Asked Answered
A

2

6

I have just upgraded our Spring Boot applications to Java 21. As a part of that, I have also done changes to use virtual threads. Both when serving API requests and when doing async operations internally using executors.

For one use-case, it seems like an Executor powered by virtual threads is performing worse than a ForkJoinPool powered by OS threads. This use-case is setting some MDC values and calling an external system through HTTP.

This is my pseudo-ish-code:

List<...> ... = executorService.submit(
                () -> IntStream.rangeClosed(-from, to)
                        .mapToObj(i -> ...)
                        .parallel()
                        .map(... -> {
                            try {
                                service.setSomeThreadLocalString(...);
                                MDC.put(..., ...);
                                MDC.put(..., ...);

                                return service.call(...);
                            } finally {
                                service.removeSomeThreadLocalString(...);
                                MDC.remove(...);
                                MDC.remove(...);
                            }
                        })
                        .toList())
        .get();

Where ExecutorService is either:

  1. new ForkJoinPool(30)
  2. Executors.newVirtualThreadPerTaskExecutor()

It looks like option 1 is performing a lot better than 2. Sometimes it is 100% faster than option 2. I have done this test in a Java 21 environment. I am testing with 10 parallel executions. Where option 1 takes 800-1000ms normally, option 2 takes 1500-2000 ms.

If it makes any difference, have this property enabled in Spring Boot:

spring:
  threads:
    virtual:
      enabled: true

Any ideas why this is happening?

Ablation answered 26/2, 2024 at 12:51 Comment(9)
Yes, virtual threads do have some overhead (and use only as many platform threads as CPU cores are available by default) and it's recommended to use platform threads for CPU intensive computations.Bluejacket
What happens when using ForkJoinPool without specifying a parallelism parameter? Can you try setting the virtual thread pool size to match the parallelism of the ForkJoinPool? Also extending on my previous comment: there's essentially no point in using virtual threads in code that doesn't block.Bluejacket
@dan1stiscrying thanks for the answer, I tried creating a ForkJoinPool without parallelism (new ForkJoinPool()) and the performance was way worse - takes about 3 seconds with it. Which virtual thread pool size do you mean to match with the parallelism of ForkJoinPool? Yes but the http call to the external service does block.Ablation
How many CPUs does you machine if? (Or if you are running on Kubernetes, how many CPU do you allocate for your container?) By default, Java allocates OS threads for executing virtual threads based on the number of CPUs. Especially when running on a VM or Kubernetes, you may need to explicitly configure the number of CPUs Java must assume and/or the number of OS threads it must allocate for virtual thread execution, because usually the values derived from the VM quota is woefully inadequate.Ozonide
@MarkRotteveel In this case I have 1 cpu for each replica (running in kubernetes). Okay I look into how to configure that. When running locally on my monster machine I get the same performance on the above 2 options btw.Ablation
@Ablation Right now, you're comparing new ForkJoinPool(30) (parallelism = 30), with a parallelism of 1 for virtual threads. It is probably worthwhile to pass -Djdk.virtualThreadScheduler.parallelism=10 and/or -XX:ActiveProcessorCount=10 (or at least, something bigger than 1 for both or either) to your java command lineOzonide
Changing ActiveProcessorCount may also have other benefits for your application, BTW.Ozonide
Thank you @MarkRotteveel, I will try the ActiveProcessorCount option.Ablation
Yes, indeed it did improve performance @MarkRotteveel. Only setting -XX:ActiveProcessorCount=10 seemed to get more unstable results than setting both that one and -Djdk.virtualThreadScheduler.parallelism=10. Thank you. If you write an answer to this question I will mark it as the accepted one. Not sure why I am getting downvoted, I think this issue could help others as well in their migration to virtual threads.Ablation
E
13

You are assuming that submitting a parallel stream operation as a job to another executor service will make the Stream implementation use that executor service. This is not the case.

There is an undocumented trick to make a parallel stream operation use a different Fork/Join pool by initiating it from a worker thread of that pool. But the executor service producing virtual threads is not a Fork/Join pool.

So when you initiate the parallel stream operation from a virtual thread, the parallel stream will use the common pool for the operation. In other words, you are still using platform threads except for the one initiating virtual thread, as the Stream implementation also performs work in the caller thread.

So when I use the following program

public class ParallelStreamInsideVirtualThread {
    public static void main(String[] args) throws Exception {
        var executorService = Executors.newVirtualThreadPerTaskExecutor();
        var job = executorService.submit(
            () -> {
              Thread init = Thread.currentThread();
              return IntStream.rangeClosed(0, 10).parallel()
                 .peek(x -> printThread(init))
                 .mapToObj(String::valueOf)
                 .toList();
            });
        job.get();
    }
  
    static void printThread(Thread initial) {
        Thread t = Thread.currentThread();
        System.out.println((t.isVirtual()? "Virtual  ": "Platform ")
            + (t == initial? "(initiator)": t.getName()));
    }
}

it will print something like

Virtual  (initiator)
Virtual  (initiator)
Platform ForkJoinPool.commonPool-worker-1
Platform ForkJoinPool.commonPool-worker-3
Platform ForkJoinPool.commonPool-worker-2
Platform ForkJoinPool.commonPool-worker-4
Virtual  (initiator)
Platform ForkJoinPool.commonPool-worker-1
Platform ForkJoinPool.commonPool-worker-3
Platform ForkJoinPool.commonPool-worker-5
Platform ForkJoinPool.commonPool-worker-2

In short, you are not measuring the performance of virtual threads at all.

Emulsify answered 26/2, 2024 at 15:4 Comment(6)
The carrier thread of a virtual thread comes from a dedicated ForkJoinPool. Why is this pool not used for the parallel stream instead of the common one?Stratus
@Stratus you do not want a single parallel stream to be able to stall all virtual threads… Do not get distracted by the fact that those two pools happen to have the same type in this particular implementation. They serve entirely different purposes.Emulsify
I have no intention of doing that. The question remains.Stratus
@Stratus When a parallel stream uses the same pool as the virtual threads, it will block all virtual threads, as the stream’s tasks won’t release a thread while they are running. Since you “have no intention of doing that”, you also do not want the stream to use the same pool as the virtual threads. As already said, these pools serve different purposes.Emulsify
Just thinking about the undocumented trick that was mentioned. "use a different Fork/Join pool by initiating it from a worker thread of that pool." - as the carrier thread of the virtual thread that initiate the stream originate from a non-common Fork/Join pool, why isn't that pool used? Can you please explain it again.Stratus
When you commence a stream operation in a virtual thread, the virtual thread is not a thread of a Fork/Join pool. The carrier thread is irrelevant. The fact that this particular implementation of virtual threads is using a Fork/Join pool behind the scenes is an unimportant implementation detail. If you forget about that fact, all the confusion immediately disappears.Emulsify
O
4

As Holger indicates in the comment below and in their answer, the actual problem is that parallel streams are not actually run on virtual threads, but on the common pool. So though CPU count (and -XX:ActiveProcessorCount) will influence the performance by virtue of also configuring the parallelism of the common pool, setting jdk.virtualThreadScheduler.parallelism is unlikely to influence much here.

I'll leave my answer for reference, but know that for this specific case, it doesn't fully apply.


As you indicate in the comments that you're using a Kubernetes container configured with 1 CPU, the problem is that effectively you're comparing a fork-join pool with a parallelism of 30 with virtual thread execution with a parallelism of 1.

By default, Java allocates an OS thread pool (a work-stealing fork-join pool) with parallelism equal to the processor count for virtual thread execution (see also JEP 444: Virtual Threads). If you're running on Kubernetes, and the pod has 1 CPU (or less than 1 CPU), Java will assume a processor count of 1.

You can tell Java to assume a different number of processors by passing the -XX:ActiveProcessorCount=n setting, with n the number of processors Java should assume. Alternatively, or additionally, you can set the system property jdk.virtualThreadScheduler.parallelism (using -Djdk.virtualThreadScheduler.parallelism=n where n is the desired parallelism). Both of these need to be passed to the java executable on the command line.

Note that you probably shouldn't set it too high compared to your CPU quota, otherwise you'll likely just introduce reasons for Kubernetes to throttle your pod, but reasonable values will depend on the actual load behaviour of your application.

Ozonide answered 26/2, 2024 at 14:43 Comment(2)
The CPU count surely influences the outcome. But there is an underlying wrong assumption, that the Stream API is using virtual threads. At the moment, there is no way to tell the Stream API to do so.Emulsify
@Emulsify You're right. Too many moving parts :|Ozonide

© 2022 - 2025 — McMap. All rights reserved.