Why doesn't the libc++ shared_ptr implementation split acq_rel fetch_sub into release fetch_sub and acquire fence?

About

Asked 17/6, 2022 at 14:42 Answered 17/6, 2022 at 14:42

c++atomic shared-ptr memory-barriers libc++

libc++ shared_ptr implementation release() for the sake of simplicity can be depicted as:

void release()
{
    if (ref_count.fetch_sub(1, std::memory_order_acq_rel) == 1)
    {
        delete this;
    }
}

Why doesn't libc++ split it into release decrement and acquire fence?

void release()
{
    if (ref_count.fetch_sub(1, std::memory_order_release) == 1)
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        delete this;
    }
}

as Boost recommends, which looks superior as it doesn't impose acquire mem order for all but the last decrement.

Selena answered 17/6, 2022 at 14:42 Comment(9)

Related: #72607395. Comments suggest that in some cases, relaxed load + acquire fence can be more expensive than acquire load. So it may depend on which is considered to be the more common case. I do imagine there is a lot of code that uses shared_ptr but never actually shares it. – Younger 17/6, 2022 at 14:45

thanks @NateEldredge, this may be a reason. I'm more interested if it's correct to use rel decrement + acq fence for shared ptr's reference controller. I'm working with a shared ptr implementation that uses this and TSAN reports a race between T1 fetch_sub and T2 destruction of the reference controller – Selena 17/6, 2022 at 14:52

The link is to an archived old version of libc++. The current one is at github.com/llvm/llvm-project/blob/main/libcxx/include/__memory/…, but nothing changed about the approach. – Colloid 17/6, 2022 at 15:10

It's possibly related to std::atomic_thread_fence being a more restrictive fence than one tied to an operation. – Pantry 17/6, 2022 at 15:19

@DrewDormann theoretically or is there any platform where it’s faster than two acq loads? Two comes from the assumption that usually there’re at least two refs – Selena 17/6, 2022 at 16:39

@NateEldredge 'I do imagine there is a lot of code that uses shared_ptr but never actually shares it.' – well, I really hope this didn't have any influence on the decision!!! Using some tool wrong, even if majority does, shouldn't ever punish those using it correctly... – Klement 17/6, 2022 at 16:40

@Aconcagua: That's true, but in a program where some shared_ptr objects are actually shared, but many others aren't, the non-shared case is still the common one. But the code-gen would still have to be correct for actually-shared objects, so you can't just avoid the work entirely. – Durable 18/6, 2022 at 3:43

Both versions are certainly correct, and you could write out a proof if you are concerned. TSAN doesn't properly recognize the effect of fences, see #70543493, so it's not surprising that it would report a false positive data race. – Younger 18/6, 2022 at 13:22

But if one was going to choose based on performance, it would have to be a heuristic. On a machine where an acquire fence is slower than an acquire load, one would have to trade it off against the relative frequency of the two cases, and I think it's very plausible that the unconditional acquire load could be faster on average than the conditional acquire fence. And of course, on other machines it makes no difference at all. On x86 every atomic RMW is automatically seq_cst and std::atomic_thread_fence(std::memory_order_acquire) is a no-op. – Younger 18/6, 2022 at 13:32

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags