This came up as a "side effect" to an an answer on another question today. It's more about curiosity than about an actual problem.
Java SE 7 offers what Oracle calls "the fork/join framework". It is a presumably superior way to schedule work to multiple processors. While I understand how it's supposed to work, I fail to understand the bit where it is superior and the claims made about work stealing.
Maybe someone else has more insight on why this approach would be desirable (other than because it has a fancy name).
The underlying primitives of fork/join are ForkJoinTask
s, which are Future
s, and the idea is to either perform work immediately [sic] (the wording is misleading as "immediately" implies that it happens synchronously in the main thread, in reality this happens inside a Future
) below a certain threshold or divide work into two tasks recursively until the threshold is reached.
A future is a concept of encapsulating a task that runs asynchronously into an object in an opaque and unspecified way. You have a function that lets you verify whether a result is available, and you get a function that lets you (wait for, and) retrieve a result.
Strictly speaking, you do not even know whether a future runs asynchronously, it could execute inside get()
. The implementation could in theory as well spawn a thread for every future or use a thread pool.
In practice, Java implements futures as tasks on a task queue, with a thread pool attached (the same is true for the entire fork/join framework).
The fork/join documentation gives this concrete usage example:
protected void compute() {
if (mLength < sThreshold) {
computeDirectly();
return;
}
int split = mLength / 2;
invokeAll(new ForkBlur(mSource, mStart, split, mDestination),
new ForkBlur(mSource, mStart + split, mLength - split,
mDestination));
}
This submits tasks to the underlying threadpool's task queue in a manner indentical to how Mergesort would traverse them (thanks to recursion).
Say for example that we have an array of 32 "items" to process and have a threshold of 4, and split evenly, it would produce 8 tasks with 4 "items" each, and look like this:
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
.
00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15|16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
. . .
00 01 02 03 04 05 06 07|08 09 10 11 12 13 14 15|16 17 18 19 20 21 22 23|24 25 26 27 28 29 30 31
. . . . . . .
00 01 02 03|04 05 06 07|08 09 10 11|12 13 14 15|16 17 18 19|20 21 22 23|24 25 26 27|28 29 30 31
------------------------------------------------------------------------------------------------
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
On a single-core processor, this will submit/execute (in a very complicated way) task groups 1-2-3-4-5-6-7-8 in order.
On a dual-core processor, this will submit/execute (1,3)-(2,4)-(5,7)-(6,8) [1].
On a quad-core processor, this will submit/execute (1,3,5,7)-(2,4,6,8).
In comparison, a naive implementation without all the superior magic would just submit the tasks 1-2-3-4-5-6-7-8 to the task queue right away. Always.
On a single-core processor, this would submit/execute 1-2-3-4-5-6-7-8.
On a dual-core processor, this would submit/execute (1,2)-(3,4)-(5,6)-(7,8).
On a quad-core processor, this would submit/execute (1,2,3,4)-(5,6,7,8).
Questions:
Instead of simply cramming sThreshold consecutive items into one task and submitting one task after another to the thread pool's task queue, a tree-like recursion hierarchy is generated. This involves constructing, referencing, and destroying N+log2(N) objects for N sub-tasks that factually do nothing. Why is this superior?
No locality of reference is preserved. Neither processor caches nor virtual memory like being treated like that. Why is this superior?
Except on a uni-processor system, tasks are guaranteed not to be scheduled in an order that is anywhere close to their original order. This may be no issue if it really does not matter, but it makes things like e.g. a fence or barrier pretty much unfeasible. The only way of having something like a fence is waiting for the root object to complete and only submit new tasks after that. This is equivalent to a complete pipeline stall (which is exactly what you don't ever want to happen).
The Oracle documentation claims that this approach implements work stealing and is therefore better than a thread pool. I don't see this happening. All I can see is a very complicated way of submitting tasks to a plain normal thread pool. How is this supposed to magically implement work stealing?
[1] Let's not make it too complicated and assume that worker threads don't outrun each other, tasks all take the same time to process. Otherwise, execution might of course happen in a different order, though submission would be the same.