Java fork/join framework logic

Asked 31/8, 2012 at 16:1 Answered 31/8, 2012 at 17:45

Solved java multithreading concurrency java.util.concurrent fork-join

This came up as a "side effect" to an an answer on another question today. It's more about curiosity than about an actual problem.

Java SE 7 offers what Oracle calls "the fork/join framework". It is a presumably superior way to schedule work to multiple processors. While I understand how it's supposed to work, I fail to understand the bit where it is superior and the claims made about work stealing.

Maybe someone else has more insight on why this approach would be desirable (other than because it has a fancy name).

The underlying primitives of fork/join are ForkJoinTasks, which are Futures, and the idea is to either perform work immediately [sic] (the wording is misleading as "immediately" implies that it happens synchronously in the main thread, in reality this happens inside a Future) below a certain threshold or divide work into two tasks recursively until the threshold is reached.

A future is a concept of encapsulating a task that runs asynchronously into an object in an opaque and unspecified way. You have a function that lets you verify whether a result is available, and you get a function that lets you (wait for, and) retrieve a result.
Strictly speaking, you do not even know whether a future runs asynchronously, it could execute inside get(). The implementation could in theory as well spawn a thread for every future or use a thread pool.
In practice, Java implements futures as tasks on a task queue, with a thread pool attached (the same is true for the entire fork/join framework).

The fork/join documentation gives this concrete usage example:

protected void compute() {
    if (mLength < sThreshold) {
        computeDirectly();
        return;
    }

    int split = mLength / 2;

    invokeAll(new ForkBlur(mSource, mStart, split, mDestination),
              new ForkBlur(mSource, mStart + split, mLength - split,
                           mDestination));
}

This submits tasks to the underlying threadpool's task queue in a manner indentical to how Mergesort would traverse them (thanks to recursion).
Say for example that we have an array of 32 "items" to process and have a threshold of 4, and split evenly, it would produce 8 tasks with 4 "items" each, and look like this:

00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
                                               .
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15|16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
                       .                       .                       .
00 01 02 03 04 05 06 07|08 09 10 11 12 13 14 15|16 17 18 19 20 21 22 23|24 25 26 27 28 29 30 31
           .           .           .           .           .           .           .
00 01 02 03|04 05 06 07|08 09 10 11|12 13 14 15|16 17 18 19|20 21 22 23|24 25 26 27|28 29 30 31
------------------------------------------------------------------------------------------------
     1     |     2     |     3     |     4     |     5     |     6     |     7     |     8     |

On a single-core processor, this will submit/execute (in a very complicated way) task groups 1-2-3-4-5-6-7-8 in order.
On a dual-core processor, this will submit/execute (1,3)-(2,4)-(5,7)-(6,8) ^[1].
On a quad-core processor, this will submit/execute (1,3,5,7)-(2,4,6,8).

In comparison, a naive implementation without all the superior magic would just submit the tasks 1-2-3-4-5-6-7-8 to the task queue right away. Always.

On a single-core processor, this would submit/execute 1-2-3-4-5-6-7-8.
On a dual-core processor, this would submit/execute (1,2)-(3,4)-(5,6)-(7,8).
On a quad-core processor, this would submit/execute (1,2,3,4)-(5,6,7,8).

Questions:

Instead of simply cramming sThreshold consecutive items into one task and submitting one task after another to the thread pool's task queue, a tree-like recursion hierarchy is generated. This involves constructing, referencing, and destroying N+log2(N) objects for N sub-tasks that factually do nothing. Why is this superior?
No locality of reference is preserved. Neither processor caches nor virtual memory like being treated like that. Why is this superior?
Except on a uni-processor system, tasks are guaranteed not to be scheduled in an order that is anywhere close to their original order. This may be no issue if it really does not matter, but it makes things like e.g. a fence or barrier pretty much unfeasible. The only way of having something like a fence is waiting for the root object to complete and only submit new tasks after that. This is equivalent to a complete pipeline stall (which is exactly what you don't ever want to happen).
The Oracle documentation claims that this approach implements work stealing and is therefore better than a thread pool. I don't see this happening. All I can see is a very complicated way of submitting tasks to a plain normal thread pool. How is this supposed to magically implement work stealing?

^[1] Let's not make it too complicated and assume that worker threads don't outrun each other, tasks all take the same time to process. Otherwise, execution might of course happen in a different order, though submission would be the same.

Neaten answered 31/8, 2012 at 16:1 Comment(0)

When you use an ExecutorService you will decide how many threads will be in the thread pool, and there is no kind of distinction between the tasks that you schedule and the subtasks that these tasks create.
ForkJoinPool class instead, manages threads based on 1)available processors and 2)task demand.
In this case, the subtasks created by the active tasks are being scheduled by different methods than the external tasks.
We typically have one fork-join pool for an entire application (unlike using the ExecutorService where it is typical to have more than 1 in any non-trivial application) and there is no need for shutdown.
I haven't reviewed the internals to give you a more low level explanation but if you see here there is a presentation and a benchmark showing measurements displaying the parallelism that is promised.

Update:
This framework addresses specific kind of problems (ExecutorService works better for tasks that have a mix of CPU and I/O activity).
The basic thinking here, is to use a recursive/divide and conquer approach in order to keep CPUs constantly busy. The idea is to create new tasks (forking) and suspend the current task until the new tasks complete (join) but without creating new threads and without having a shared work queue.
So Fork-join framework is implemented using work-stealing by creating a limited number of worker threads(as many as cores). Each worker thread maintains a private double-ended work queue.
When forking, worker pushes new task at the head of its deque. When waiting or idle, worker pops a task off the head of its deque and executes it instead of sleeping.
If worker’s deque is empty, steals an element off the tail of the deque of another randomly chosen worker.
I would recomend to read Data Parallelism in Java and also do some benchmarks yourself to be convinced. Theory is good only up to a point. After that do your measurements to see if there is significant performance edge or not

Tally answered 31/8, 2012 at 16:21 Comment(4)

So basically it all boils down to "automatically chooses number or workers according to number of cores", and the rest is just a fancy way of writing the submission code and marketing blah blah? Or worded a little more positive, it's meant to avoid "too naive" user code with over-threading (i.e. several ExecutorServices with a lot more threads than there are cores). – Neaten 31/8, 2012 at 18:35

Your description of work stealing (popping off another thread's head-of-queue) is exactly correct. My issue is I don't see how this relates to recursively subdividing in any way. This is something that would work equally well with the most naive ways of submitting tasks (say, a simple round-robin assignment, or even a shared queue where it would not do much but reduce lock contention), or for Futures in general. – Neaten 1/9, 2012 at 12:20

The presentation you liked to suggests it's just as I thought: "Clunky code => people won’t bother with it" -- in other words, the divide and conquer method adds nothing in terms of performance or work stealing (this happens independently in the Future implementation). Instead, this approach is a fancy programming model so people actually want to use it. Thank you for the reference and explanation. – Neaten 1/9, 2012 at 17:9

Quite old post, but just stumbled upon this discussion: I don't think that a "fancy programming model" covers it all. (1) The F/J framework provides tasks, which are light-weight in user-level and more light-weight (less overhead) than scheduling threads in OS on CPUs; (2) work stealing is a pull-based load-balancing mechanism (idle workers steal tasks) and, hence, superior over work-sharing that can lack from underutilized workers; and (3) fork-join is a common concept for concurrency management as we can see it with POSIX threads too. It's similar to Intel's TBB, which is a good read BTW. – Subkingdom 3/1, 2017 at 17:58

Let me start with an article [yes I wrote it] that critiques the framework. A Java Fork-Join Calamity

Now to your questions:

It's not. The framework wants to process a DAG. That's the design structure.
It's not. As the article mentions, Java applications know nothing about caches, memory etc. so the assumptions are erroneous.
Yes. That is exactly what happens. Stalls are so common that the framework needs to create "continuation threads" to keep moving. The article references a question here where over 700 continuation threads were needed.
I certainly agree that the code is complex. Scatter-gather works much better than work-stealing for applications. As far as documentation, what documentation? There are no details from Oracle. Its all a push to use the framework.

There are alternatives.

Cullis answered 31/8, 2012 at 17:45 Comment(3)

This article and the API docs make the above claims about work stealing. I just don't get how this is supposed to happen. – Neaten 31/8, 2012 at 18:32

Not sure what you mean by "this"? Work-stealing works well in operating systems. Work-sharing works well in applications. The problem with the F/J framework is that it an application trying to resemble an operating system or pseudo operating system but without task control, error recovery, documentation etc. – Cullis 2/9, 2012 at 17:48

I don't doubt work stealing works fine, there isn't even anything fundamentally wrong with implementing work stealing in an application (though it's not what I'd preferrably use). However, the docs state that the FW is "distinct because it uses a work-stealing algorithm", which is not something that subdividing recursively can provide (although they make it sound like that). It's an implementation detail of the thread pool implementation (at least, I don't see how it could be any different). – Neaten 2/9, 2012 at 23:13

Recommended topics

Hot tags