Java ForkJoinPool with non-recursive tasks, does work-stealing work?
Asked Answered
R

1

29

I want to submit Runnable tasks into ForkJoinPool via a method:

forkJoinPool.submit(Runnable task)

Note, I use JDK 7.

Under the hood, they are transformed into ForkJoinTask objects. I know that ForkJoinPool is efficient when a task is split into smaller ones recursively.

Question:

Does work-stealing still work in the ForkJoinPool if there is no recursion?

Is it worth it in this case?

Update 1: Tasks are small and can be unbalanced. Even for strictly equal tasks, such things like context switching, thread scheduling, parking, pages misses etc. get in the way leading to the imbalance.

Update 2: Doug Lea wrote in the Concurrency JSR-166 Interest group, by giving a hint on this:

This also greatly improves throughput when all tasks are async and submitted to the pool rather than forked, which becomes a reasonable way to structure actor frameworks, as well as many plain services that you might otherwise use ThreadPoolExecutor for.

I presume, when it comes to reasonably small CPU-bound tasks, ForkJoinPool is the way to go, thanks to this optimization. The main point is that these tasks are already small and needn't a recursive decomposition. Work-stealing works, regardless whether it is a big or small task - tasks can be grabbed by another free worker from the Deque's tail of a busy worker.

Update 3: Scalability of ForkJoinPool - benchmarking by Akka team of ping-pong shows great results.

Despite this, to apply ForkJoinPool more efficiently requires performance tuning.

Raynell answered 5/5, 2015 at 7:52 Comment(5)
I have wondered this myself and I think it's what you mentioned with the parking and locking of the threads. This is why I think ring buffers (disruptor) are even faster than a unwrapped fork join (at least for my non recursive event tasks).Fascism
@Adam Gent We used Disruptor too - It's amazingly fast. And yes, the idea of RingBuffer is applied in many modern data structures. However, they differ in usecases: 1. a Disruptor uses 1 thread - 1 consumer model for pipelining and cannot steal work, whereas 2. FJ is a pool of workers - work is distributed across a numbers of threads (m:n).Raynell
I'm familiar with the differences of fj - rb but I think they both might be benifitting of less locking but I'm not an expert by any means.Fascism
That is my guess that fj will at times perform faster for non recursive parallel tasks is because there is less locking than the threadpoolexecutor. I tried benchmarking this myself by replacing some of our parallel cpu bound threadpool ExecutorServices with FJ and there was a very very negligible improvement if any at all.Fascism
@Adam Gent Right, threadpool ExecutorServices use a FIFO blocking queue which is poorly scalable on multicore CPUs - multiple threads take/put from the head resulting in a higher contention unlike Deques in FJ. I think, you will see the difference on more CPUs. For 2-4 cores - no big win. Moreover, as been said this requires appropriate tuningRaynell
H
19

ForkJoinPool source code has a nice section called "Implementation Overview", read up for an ultimate truth. The explanation below is my understanding for JDK 8u40.

Since day one, ForkJoinPool had a work queue per worker thread (let's call them "worker queues"). The forked tasks are pushed into the local worker queue, ready to be popped by the worker again and be executed -- in other words, it looks like a stack from worker thread perspective. When a worker depletes its worker queue, it goes around and tries to steal the tasks from other worker queues. That is "work stealing".

Now, before (IIRC) JDK 7u12, ForkJoinPool had a single global submission queue. When worker threads ran out of local tasks, as well the tasks to steal, they got there and tried to see if external work is available. In this design, there is no advantage against a regular, say, ThreadPoolExecutor backed by ArrayBlockingQueue.

It changed significantly after then. After this submission queue was identified as the serious performance bottleneck, Doug Lea et al. striped the submission queues as well. In hindsight, that is an obvious idea: you can reuse most of the mechanics available for worker queues. You can even loosely distribute these submission queues per worker threads. Now, the external submission goes into one of the submission queues. Then, workers that have no work to munch on, can first look into the submission queue associated with a particular worker, and then wander around looking into the submission queues of others. One can call that "work stealing" too.

I have seen many workloads benefiting from this. This particular design advantage of ForkJoinPool even for plain non-recursive tasks was recognized a long ago. Many users at concurrency-interest@ asked for a simple work-stealing executor without all the ForkJoinPool arcanery. This is one of the reasons, why we have Executors.newWorkStealingPool() in JDK 8 onward -- currently delegating to ForkJoinPool, but open for providing a simpler implementation.

Halsy answered 6/5, 2015 at 18:54 Comment(1)
Can you please elaborate more on how Executors.newWorkStealingPool() is different from ForkJoinPool and/or ForkJoinPool.commonpool()Lukas

© 2022 - 2024 — McMap. All rights reserved.