Threadsafe lazy initialization: static vs std::call_once vs double checked locking
Asked Answered
G

1

22

For threadsafe lazy initialization, should one prefer a static variable inside a function, std::call_once, or explicit double checked locking? Are there any meaningful differences?

All three can be seen in this question.

Double-Checked Lock Singleton in C++11

Two versions of double checked locking in C++11 turn up in Google.

Anthony Williams shows both double checked locking with explicit memory ordering and std::call_once. He doesn't mention static but that article might have been written before C++11 compilers were available.

Jeff Preshing, in an extensive writeup, describes several variations of double checked locking. He does mention using a static variable as an option and he even shows that compilers will generate code for double checked locking to initialize a static variable. It's not clear to me if he concludes that one way is better than the other.

I get the sense that both articles are meant to be pedagogical and that there's no reason to do this. The compiler will do it for you if you use a static variable or std::call_once.

Gaucho answered 24/9, 2014 at 9:43 Comment(3)
Be forewarned that VC++ is running behind on thread-safe function local statics. They are not in VS2013. But are reported to be in VS2014: blogs.msdn.com/b/vcblog/archive/2014/06/11/…Arabeila
On the other hand, GCC can make local statics faster than call_once or double-checked because it can use platform specific tricks to avoid any atomic operations at all.Individually
@CortAmmon If you post that as an answer with some evidence I'd accept.Gaucho
I
40

GCC uses platform specific tricks to avoid atomic operations entirely on the fast path, leveraging the fact that it can do analysis of static better than call_once or double-checking.

Because double-checking uses atomics as its method of avoiding race cases, it has to pay the price of an acquire every time. It's not a high price, but it's a price.

It has to pay this because atomics have to remain atomic in all cases, even difficult operations like compare-exchange. This makes it very hard to optimize out. Generally speaking, the compiler has to leave it in, just in case you use the variable for more than just a double-lock. It has no easy way of proving that you never use one of the more complicated operations on your atomic.

On the other hand, static is highly specialized, and part of the language. It was designed, from the start, to be very easy to provably initialize. Accordingly, the compiler can take shortcuts that were not available to the more generic version. The compiler actually emits the following code for a static:

a simple function:

void foo() {
    static X x;
}

is rewritten inside GCC to:

void foo() {
    static X x;
    static guard x_is_initialized;
    if ( __cxa_guard_acquire(x_is_initialized) ) {
        X::X();
        x_is_initialized = true;
        __cxa_guard_release(x_is_initialized);
    }
}

Which looks a lot like a double-checked lock. However, the compiler gets to cheat a little here. It knows the user can never write use a cxa_guard directly. It knows that it is only used in the special circumstances where the compiler chooses to use it. Thus, with that extra information, it can save some time. The CXA guard specifications, as distributed as they are, all share a common rule: __cxa_guard_acquire will never modify the first byte of the guard, and __cxa_guard__release will set it to non-zero.

This means each guard has to be monotonic, and it specifies exactly what operations will do so. Accordingly it can take advantage of existing race-case protections within the host platform. On x86, for instance, the LL/SS protection guaranteed by the strongly synchronized CPUs turns out to be enough to do this acquire/release pattern, so it can do a raw read of that first byte when it does its double locking, rather than an acquire-read. This is only possible because GCC isn't using the C++ atomic API to do its double locking -- it is using a platform specific approach.

GCC cannot optimize out the atomic in the general case. On architectures which are designed to be less synchronized (such as those designed for 1024+ cores), GCC doesn't get to rely on the archetecture to do LL/SS for it. Thus GCC is forced to actually emit the atomic. However, on common platforms such as x86 and x64, it can be faster.

call_once can have the efficiency of GCC's statics, because it similarly limits the number of operations which can be done to a once_flag to a fraction of the functions that can be applied to an atomic. The tradeoff is that statics are far more convenient to use, when they are applicable, but call_once works in many cases where statics are insufficient (such as a once_flag owned by a dynamically generated object).

There is a slight difference in performance between static and call_once on these higher platforms. Many of these platforms, while not offering LL/SS, will at least offer non-tearing reads of an integer. These platforms can use this, and a thread-specific-pointer, to do per-thread epoch counting to avoid atomics. This is sufficient for static or call_once, but depends on the counter not rolling over. If you do not have a tearing-free 64-bit integer, call_once has to worry about rollover. The implementation may or may not worry about this. If it ignores this issue, it can be as fast as statics. If it pays attention to that issue, it has to be as slow as atomics. Static knows at compile time how many static variables/blocks there are, so it can prove there is no rollover at compile time (or at least be darn confident!)

Individually answered 29/11, 2014 at 20:16 Comment(10)
I realize I'm several years late to this excellent answer. For the uninitiated, what do you mean by "LL/SS protection"?Parang
@Parang LL/SS is "Load Load/Store Store" It states that two loads of a memory adddress are guaranteed to occur in order, and two stores of a memory address are guaranteed to occur in order. This guarantee, however, does not say anything about the ordering of loads and stores with respect to each other.Individually
Thanks for the clarification. Another question: I tried loading your foo example above into godbolt.org, and under GCC I am seeing the call to acquire happening before the comparison. Is this expected? godbolt.org/g/oV65PLParang
@Parang That does line up with what the code should do. If I'm reading the assembly output of godbolt the same as you are, the comparison you mention is on the return value of the acquire. Re-reading my wording, I don't think the paragraph after the code lines up quite as well as it should. What I intended to say was within the call to acquire, gcc gets to do a native read (rather than an atomic read) of the value on many platforms. In the proper C++ atomic memory model that is not safe, but gcc knows which platform you are on and which platforms it is safe to use that trick.Individually
Thanks for the clarification, I really appreciate it.Parang
I'm kind of confused regarding this answer. On x86, we always get a simple mov instruction + branch, regardless of whether the load is raw or acquire, right? Yet you're claiming a raw read would be faster? Why? Do you say that because the compiler could theoretically somehow re-order any post-initialization loads to before the initialization? Wouldn't the effect of that be absolutely minuscule if it did that, especially given branch prediction would let the CPU keep executing past the check anyway? I feel like I'd have to see a demonstration of this to believe it.Elayneelazaro
@Elayneelazaro A raw read is a mov. An acquire read is a lock mov, which takes more time because it has to look at the caches of the other cores. My own micro-benchmarks peg the cost somewhere on the order of 10 cycles (on one machine, with a particular test). Other sites I've looked at agree with that general estimate.Individually
@CortAmmon: Confused, what do you mean by lock mov? There is no such instruction emitted on x86 for an acquire read. Example: godbolt.org/z/nxxhTfdqeElayneelazaro
@Elayneelazaro Okay, now I see what you're looking at. You're right, there's no lock prefix. Which actually surprises me greatly, I need to stew on that. Even with memory_order_seq_cst, it seems like the x86 memory order rules are considered to be sufficient without locking. I was able to force it to do the lock by doing a += 5, but that's a different operation all together. Now I need to ponder why my microbenchmarks might be so slow.Individually
@CortAmmon: If you can share your benchmarks I could help look at them, but yeah - I'm not sure what you were measuring otherwise.Elayneelazaro

© 2022 - 2024 — McMap. All rights reserved.