Using C++11 thread_local with other parallel libraries
Asked Answered
M

2

9

I have a simple question, can C++11 thread_local be used with other parallel models.

For example, can I use it within a function while using OpenMP or Intel TBB to parallel the tasks.

Most such parallel programming models hide hardware threads behind higher level API. My instinct is that they all have to map their task schedulers into hardware threads. Can I expect that C++11 thread_local will have expected effect.

A simple example is,

void func ()
{
    static thread_local some_var = init_val;
#pragma omp parallel for [... clauses ...]
    for (int i = 0; i < N; ++i) {
        // access some_var somewhere within the loop
    }
}

Can I expect that each OpenMP thread will access its own copy of some_var?

I know that most parallel programming models have their own constructs for thread-local storage. However, having the ability to use C++11 thread_local (or compiler specific keyword) is nice. For example, consider the situation

// actually may implemented with a class with operator()
void func ()
{
     static thread_local some_var;
     // a quite complex function
}

void func_omp (int N)
{
#pragma omp for [... clauses ...]
    for (int i = 0; i < N; ++i)
        func();
}

void func_tbb (int N)
{
      tbb::parallel_for(tbb::blocked_range<int>(0, N), func);
}

void func_select (int N)
{
     // At runtime or at compile time, based which programming model is available,
     // select to run func_omp or func_tbb
}

The basic idea here is that func may be quite complex. I want to support multiple parallel programming models. If I use parallel programming specific thread-local constructs, then I have implement different versions of func or at least partial of it. However, if I can freely use C++11 thread_local, then in addition to func I only need to implement a few very simple functions. And for a larger project things can be further simplified by using templates to write more generic versions of func_omp, func_tbb. However, I am not quite sure it is safe to do so.

Maduro answered 27/1, 2014 at 4:27 Comment(2)
If you would like to use tbb why not use combinable provided by tbb itself ?Theodore
@Jagnnath I believe I laid out the reason within the questionMaduro
S
11

On the one side, the OpenMP specification intentionally omits any specifications concerning interoperability with other programming paradigms and any mixing of C++11 threading with OpenMP is non-standard and vendor-specific. On the other side, compilers (at least GCC) tend to use the same underlying TLS mechanism to implement OpenMP's #pragma omp threadprivate, C++11's thread_local and the various compiler-specific storage classes like __thread.

For example, GCC implements its OpenMP runtime (libgomp) entirely on top of the POSIX threads API and implements OpenMP threadprivate by placing the variables on the ELF TLS storage. This interoperates with GNU's C++11 implementation that also uses POSIX threads and places thread_local variables on the ELF TLS storage. Ultimately this interoperates with code that uses the __thread keyword to specify thread-local storage class and explicit POSIX threads API calls. For example, the following code:

int foo;
#pragma omp threadprivate(foo)

__thread int bar;

thread_local int baz;

int func(void)
{
   return foo + bar + baz;
}

compiles into:

    .globl  foo
    .section        .tbss,"awT",@nobits
    .align 4
    .type   foo, @object
    .size   foo, 4
foo:
    .zero   4
    .globl  bar
    .align 4
    .type   bar, @object
    .size   bar, 4
bar:
    .zero   4
    .globl  baz
    .align 4
    .type   baz, @object
    .size   baz, 4
baz:
    .zero   4

    movl    %fs:foo@tpoff, %edx
    movl    %fs:bar@tpoff, %eax
    addl    %eax, %edx
    movl    %fs:baz@tpoff, %eax

Here the .tbss ELF section is the thread-local BSS (uninitialised data). All three variables are created and accessed in the same way.

Interoperability is of less concern right now with other compilers. Intel does not implement thread_local while Clang still misses OpenMP support.

Stereoisomer answered 27/1, 2014 at 9:11 Comment(4)
Just to be sure that I got the idea right. In general I cannot assume that thread_local (or __thread or other keywords in other compilers) can be incorporated with other programming models. In short, the way the compiler implement thread_local variables may or may not be compatible with the way the library create threads. Now, think of it, I believe Apple's GCD is not built on top of POSIX while still can be used with GCCMaduro
What I state is that there is no guarantee that the OpenMP runtime creates threads the same way C++11 does and that the TLS storages are implemented the same way. For example Intel C/C++ compiler implements __thread and #pragma omp threadprivate differently by default. OpenMP threads can still access thread-local variables created by __thread but the semantics might differ. Apple's GCD is built on top of POSIX threads too - see the source code.Stereoisomer
So in short, because the semantics might differ, the behavior might be surprising sometime?Maduro
I'm only saying that it is very compiler-dependent and might result in non-portable programs if utilised. Except it to work with GCC though.Stereoisomer
H
3

My answer is restricted to TBB and Cilk Plus. For TBB, using thread_local will not cause any surprises since TBB is designed to be a library-based solution that uses the platform's threading.

For Cilk Plus, thread_local should be avoided, because it can cause surprises due to "continuation stealing" and "greedy scheduling". See N3872 about what those mean. See also N3487 for why thread-local variables are problematic for parallelism. Instead, in Cilk Plus, use reducers, which are different than thread-local storage.

Hydrochloride answered 27/1, 2014 at 15:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.