How to make a threadpool in c++ TBB?
Asked Answered
C

2

5

I might not be measuring this correctly but I have some simple code I'm playing with. I don't think its a threadpool because if I make the work unit very large then the cpu goes to 190-199% (I have dual core) but if I lower make the work unit smaller but much more of those units than my cpu runs the program at 140-160%.

What I think is going on is threads are not being pooled but are being destroyed/created when needed, which is making the program slower when the workloads are smaller. I'm using tbb::task_scheduler_init to control the number of threads but I don't know how to tell tbb to keep the thread alive.

Here's some code to illustrate the problem:

#include <iostream>
#include <list>
#include <tbb/task.h>
#include <tbb/task_group.h>
//#include <stdlib.h>
#include "tbb/task_scheduler_init.h"
#include <boost/thread.hpp>

using namespace tbb;


long fib(long a)
{
    if (a < 2) return 1;
    
    return fib(a - 1) + fib(a - 2);
}

class PrintTask
{
public:
    void operator()()
    {
        //std::cout << "hi world!: " <<  boost::this_thread::get_id() << std::endl;
        
        fib(10);
    }
};

int main()
{
    tbb::task_scheduler_init init(4); //creates 4 threads
    task_group group;
    
    
    for (int i = 0; i < 2000000; ++i)
    {
        group.run(PrintTask());
        //std::cout << i << std::endl;
    }
    
    std::cout << "done" << std::endl;
    group.wait();
    
    return(0);
}

If you change fib to 40-45, the work for each thread becomes large so it hits full speed, but if you use the current setup then the jobs are very small but it does many of them.

note: A thing I noticed that maybe related is that in the above case, it completely uses my memory (4 gigs free). Could the slowdown be related to that? Also why would this program take all the memory? What is it storing it in memory, if I'm just calling the thread isn't there a queue telling it how many times to run or is it saving the entire thread in memory to run?

I read the tutorial but am confused at its behavior still (although I do get the intended outcome).

Chandelier answered 15/5, 2012 at 14:2 Comment(5)
I'm not quite sure what the question is.It is unlikely that TBB is creating / destroying threads, if you think it might be check the threadid.Windle
@Windle ahhh your right...then maybe I have a deeper misunderstanding of threads because why is smaller work making it run slower? I'm not sharing any data between the threads so locking shouldn't be a prob. the only thing I could think of was threads were taking a while to start(but your right the thread id's are the same).Chandelier
when I try the Java equivalent with a threadpool and executor.submit to send work, then it seems to run constantly at near 190%+Chandelier
Depending on your code, Java's garbage collector could happily use up to a CPU or so's worth of processing, on top of any real work your app is getting done. So comparing CPU utilisation between Java and C++ isn't really telling you anything useful. The only thing which matters with parallelism is does it get what you want done faster by elapsed wall clock time.Consistence
Having void operator()() not const will end up with Error C3848: expression having type 'const PrintTask' would lose some const-volatile qualifiers in order to call 'void PrintTask::operator ()(void) in MSVCSilicify
W
6

You shouldn't be worried so much about CPU utilization but rather look at execution time and speedup vs. sequential.

There's a few things that you need to understand about tbb:

You can think of the overhead of scheduling a task as being basically constant (it isn't quite but it's close enough). The smaller the amount of work you schedule the closer it comes to approaching this constant. If you get close to this constant you won't see speedups.

Also threads go idle when they can't find work. I'm guessing that when you're calling 'fib(10)' that the cost of calling run and the necessary task allocations are approaching the cost of actually executing the work.

More specifically, if you really do have 2000000 items with the same task you should probably be calling parallel_for or parallel_for_each.

-Rick

Windle answered 15/5, 2012 at 17:34 Comment(1)
Thanks so much Rick. I'm really unsure how to approach my problem then..I asked a question about how to deal with changing workloads(i.e. when you have thread pools that are consuming a queue)..in the comments you suggested looking into task_group which seemed to do what I wanted(except the overhead issue) because parallel_for(I believe) needs to know how much work is required ahead of time. my workload is very similar to this question(lots of small work thats producing/consuming the same queue). Using tbb is there a better way to achieve set it up?Chandelier
C
2

Just to emphasise Rick's last point, the docs for task_group state:

CAUTION: Creating a large number of tasks for a single task_group is not scalable, because task creation becomes a serial bottleneck. If creating more than a small number of concurrent tasks, consider using parallel_for (4.4) or parallel_invoke (4.12) instead, or structure the spawning as a recursive tree.

I'd actually add parallel_do to that list as likely applicable to your don't-know-number-of-tasks up-front use-case. Its arguments include a "feeder" which can add more tasks; but again note the comments in the docs re not dealing with tasks one at a time. Selective quote:

Design your algorithm such that the body often adds more than one piece of work. ... To achieve speedup, the grainsize of B::operator() needs to be on the order of at least ~100,000 clock cycles. Otherwise, the internal overheads of parallel_do swamp the useful work.

If you're looking for examples of non-trivial TBB (beyond parallel_for and the tutorial etc), the TBB patterns manual is very good.

Consistence answered 15/5, 2012 at 19:37 Comment(2)
Thanks so much timday, you have gotten me interested in parallel_do. I'm a bit confused at how to run this. Do I at first give it a range of data to process(from a queue/list) and then use the feeder option to continuously add more work to it?Chandelier
parallel_do is handed individual tasks from the range specified to parallel_do, rather than an actual range like parallel_for's "body" is. This has some implications for how large the individual tasks should be compute-wise, else overheads will dominate. Anton's ping-ponging parallel_fors on your other question https://mcmap.net/q/2033699/-tbb-for-a-workload-that-keeps-changing might be a better solution in the case of many tiny tasks, from the point of view of minimising overhead-per-task.Consistence

© 2022 - 2024 — McMap. All rights reserved.