How to multithread "tail call" recursion using TBB
Asked Answered
T

3

7

I am trying to use tbb to multi-thread an existing recursive algorithm. The single-thread version uses tail-call recursion, structurally it looks something like this:

void my_func() {
    my_recusive_func (0);
}

bool doSomeWork (int i, int& a, int& b, int& c) {
    // do some work
}

void my_recusive_func (int i) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        my_recusive_func (a);
        my_recusive_func (b);
        my_recusive_func (c);
    }
}

I am a tbb novice so my first attempt used the parallel_invoke function:

void my_recusive_func (int i) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        tbb::parallel_invoke (
                [a]{my_recusive_func (a);},
                [b]{my_recusive_func (b);},
                [c]{my_recusive_func (c);});
    }
}

This does work and it runs faster than the single-threaded version but it doesn't seem to scale well with number of cores. The machine I'm targeting has 16 cores (32 hyper-threads) so scalability is very important for this project, but this version only gets about 8 times speedup at best on that machine and many cores seem idle while the algorithm is running.

My theory is that tbb is waiting for the child tasks to complete after the parallel_invoke so there may be many tasks sitting around idle waiting unnecessarily? Would this explain the idle cores? Is there any way to get the parent task to return without waiting for the children? I was thinking perhaps something like this but I don't know enough about the scheduler yet to know if this is OK or not:

void my_func()
{
    tbb::task_group g;
    my_recusive_func (0, g);
    g.wait();
}

void my_recusive_func (int i, tbb::task_group& g) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        g.run([a,&g]{my_recusive_func(a, g);});
        g.run([b,&g]{my_recusive_func(b, g);});
        my_recusive_func (c, g);
    }
}

My first question is is tbb::task_group::run() thread-safe? I couldn't figure that out from the documentation. Also, is there better way to go about this? Perhaps I should be using the low-level scheduler calls instead?

(I typed this code without compiling so please forgive typos.)

Thermit answered 24/5, 2014 at 13:31 Comment(2)
Your two versions don't do the same thing. The non-parallel version effectively calls my_recusive_func(i+1); my_recusive_func(i+2); my_recusive_func(i+3); The parallel version calls my_recusive_func(i+1) three times since it captures i by value, and shouldn't even compile because you can't modify captured values in the lambda body unless the lambda is declared mutable.Ioves
task_group::run is thread safe.Seductress
A
3

There are really two questions here:

  1. Is the TBB implementation of task_group::run thread-safe? Yes. (We should document this more clearly).
  2. Is having many threads invoke method run() on the same task_group scalable? No. (I believe the Microsoft documentation mentions this somewhere.) The reason is that the task_group becomes a centralized point of contention. It's just a fetch-and-add in the implementation, but that's still ultimately unscalable since the affected cache line has to bounce around.

It's generally best to spawn a small number of tasks from a task_group. If using recursive parallelism, give each level its own task_group. Though the performance will likely not be any better than using parallel_invoke.

The low-level tbb::task interfaces is the best bet. You can even code the tail-recursion in that, using the trick where tasK::execute returns a pointer to the tail-call task.

But I'm a bit concerned about the idling threads. I'm wondering if there is enough work to keep the threads busy. Consider doing work-span analysis first. If you are using the Intel compiler (or gcc 4.9) you might try experimenting with a Cilk version first. If that won't speed up, then even the low-level tbb::task interface is unlikely to help, and higher-level issues (work and span) need to be examined.

Apogeotropism answered 27/5, 2014 at 15:20 Comment(6)
Very helpful, thank you. Wouldn't using a thread_group at each level require a wait at each level too? I'm trying to avoid that because the recursive function doesn't return anything so the wait isn't necessary. I would think it would cause a lot of idle tasks, ie parents waiting for their children to complete unnecessarily. Maybe I'm wrong to be worried about that? I know the "tail call" optimization made my serial version much faster.Thermit
Would using tbb::combinable<tbb::task_group> (instead of just one task_group) help with the contention problem?Thermit
The tbb::combinable<tbb::task_group> is an interesting suggestion. I don't have any any experience with that combination. WApogeotropism
The thread that calls task_group::wait does not necessarily remain idle while there are child tasks. The caller helps out with running any unstarted children, and if there are no unstarted children, starts stealing tasks from other threads. Where a temporary jam can occur is if the children finish while that thread is processing other tasks.Apogeotropism
The low-level tbb::task interface, when used in "continuation passing style", can avoid even the "temporary jam" mentioned above, because it allows writing code that never waits while holding a call stack.Apogeotropism
Thanks so much, by using separate structured_task_groups at each level I'm seeing ~16x speed up on 16 cores vs a single thread. It appears the bottleneck was elsewhere in the code. As you suggested, using one shared task_group did not scale well at all and combinable<task_group> wasn't much better. When I have time I'll try with the task class directly but for now I have to move on to other bottlenecks.Thermit
I
3

I'm fairly sure tbb::task_group::run() is thread-safe. I can't find a mention in the documentation, which is quite surprising.

However,

  • This 2008 blog post contains a primitive implementation of task_group, whose run() method is clearly noted to be thread-safe. The current implementation is pretty similar.
  • The testing code for tbb::task_group (in src/test/test_task_group.cpp) comes with a test designed to test the thread-safety of task_group (it spawns a bunch of threads, each of which calls run() a thousand times or more on the same task_group object).
  • The sudoku example code (in examples/task_group/sudoku/sudoku.cpp) that comes with TBB also calls task_group::run from multiple threads in a recursive function, essentially the same way your proposed code is doing.
  • task_group is one of the features shared between TBB and Microsoft's PPL, whose task_group is thread-safe. While the TBB documentation cautions that the behavior can still differ between the TBB and the PPL versions, it would be quite surprising if something as fundamental as thread-safety (and hence the need for external synchronization) is different.
  • tbb::structured_task_group (described as "like a task_group, but has only a subset of the functionality") has an explicit restriction that "Methods run, run_and_wait, cancel, and wait should be called only by the thread that created the structured_task_group".
Ioves answered 24/5, 2014 at 15:10 Comment(1)
I just noticed this in the docs: "If creating more than a small number of concurrent tasks...structure the spawning as a recursive tree." I think that is exactly what I'm doing and implies it must be thread safe.Thermit
A
3

There are really two questions here:

  1. Is the TBB implementation of task_group::run thread-safe? Yes. (We should document this more clearly).
  2. Is having many threads invoke method run() on the same task_group scalable? No. (I believe the Microsoft documentation mentions this somewhere.) The reason is that the task_group becomes a centralized point of contention. It's just a fetch-and-add in the implementation, but that's still ultimately unscalable since the affected cache line has to bounce around.

It's generally best to spawn a small number of tasks from a task_group. If using recursive parallelism, give each level its own task_group. Though the performance will likely not be any better than using parallel_invoke.

The low-level tbb::task interfaces is the best bet. You can even code the tail-recursion in that, using the trick where tasK::execute returns a pointer to the tail-call task.

But I'm a bit concerned about the idling threads. I'm wondering if there is enough work to keep the threads busy. Consider doing work-span analysis first. If you are using the Intel compiler (or gcc 4.9) you might try experimenting with a Cilk version first. If that won't speed up, then even the low-level tbb::task interface is unlikely to help, and higher-level issues (work and span) need to be examined.

Apogeotropism answered 27/5, 2014 at 15:20 Comment(6)
Very helpful, thank you. Wouldn't using a thread_group at each level require a wait at each level too? I'm trying to avoid that because the recursive function doesn't return anything so the wait isn't necessary. I would think it would cause a lot of idle tasks, ie parents waiting for their children to complete unnecessarily. Maybe I'm wrong to be worried about that? I know the "tail call" optimization made my serial version much faster.Thermit
Would using tbb::combinable<tbb::task_group> (instead of just one task_group) help with the contention problem?Thermit
The tbb::combinable<tbb::task_group> is an interesting suggestion. I don't have any any experience with that combination. WApogeotropism
The thread that calls task_group::wait does not necessarily remain idle while there are child tasks. The caller helps out with running any unstarted children, and if there are no unstarted children, starts stealing tasks from other threads. Where a temporary jam can occur is if the children finish while that thread is processing other tasks.Apogeotropism
The low-level tbb::task interface, when used in "continuation passing style", can avoid even the "temporary jam" mentioned above, because it allows writing code that never waits while holding a call stack.Apogeotropism
Thanks so much, by using separate structured_task_groups at each level I'm seeing ~16x speed up on 16 cores vs a single thread. It appears the bottleneck was elsewhere in the code. As you suggested, using one shared task_group did not scale well at all and combinable<task_group> wasn't much better. When I have time I'll try with the task class directly but for now I have to move on to other bottlenecks.Thermit
S
0

You could alternatively implement this as follows:

constexpr int END = 10;
constexpr int PARALLEL_LIMIT = END - 4;
static void do_work(int i, int j) {
    printf("%d, %d\n", i, j);
}

static void serial_recusive_func(int i, int j) {
    // DO WORK HERE
    // ...
    do_work(i,j);
    if (i < END) {
        serial_recusive_func(i+1, 0);
        serial_recusive_func(i+1, 1);
        serial_recusive_func(i+1, 2);
    }
}

class RecursiveTask : public tbb::task {
    int i;
    int j;
public:
    RecursiveTask(int i, int j) :
        tbb::task(),
        i(i), j(j)
    {}
    task *execute() override {
        //DO WORK HERE
        //...
        do_work(i,j);
        if (i >= END) return nullptr;
        if (i < PARALLEL_LIMIT) {
            auto &c = *new (allocate_continuation()) tbb::empty_task();
            c.set_ref_count(3);
            spawn(*new(c.allocate_child()) RecursiveTask(i+1, 0));
            spawn(*new(c.allocate_child()) RecursiveTask(i+1, 1));
            recycle_as_child_of(c);
            i = i+1; j = 2;
            return this;
        } else {
            serial_recusive_func(i+1, 0);
            serial_recusive_func(i+1, 1);
            serial_recusive_func(i+1, 2);
        }
        return nullptr;
    }
};
static void my_func()
{
    tbb::task::spawn_root_and_wait(
        *new(tbb::task::allocate_root()) RecursiveTask(0, 0));
}
int main() {
    my_func();
}

Your question didn't include much information about the "do work here", so my implementation doesn't give do_work much opportunity to return a value or to affect the recursion. If you need that, you should edit your question to include a mention of what sort of effect "do work here" is expected to have on the overall computation.

Synsepalous answered 24/5, 2014 at 15:0 Comment(4)
my real code is a broad phase collision detection algorithm. "do work here" will be calling a narrow phase collision detection and response function passed into the algorithm, so it will vary, but it will always be thread safe and never return anything.Thermit
This isn't very different from a parallel_invoke. The parent tasks are still waiting for the children before returning.Ioves
using tbb::task_group seems so much simpler, it this more efficient? or maybe what i proposed is not allowed? do you mind explaining the practical differences?Thermit
@atb: task_group is simpler, but it is missing documentation regarding its thread-safety. It is probably ok, but my answer provides an alternative in case you are still worried. The other advantages of my answer are that (since my edit) it does not block at any stage (since execute() returns immediately), and that it does not make large numbers of calls into a top-level function (parallel_invoke must always create/find the relevant contextual information before doing anything).Synsepalous

© 2022 - 2024 — McMap. All rights reserved.