Why doesn't the libc++ shared_ptr implementation split acq_rel fetch_sub into release fetch_sub and acquire fence?
Asked Answered
S

0

7

libc++ shared_ptr implementation release() for the sake of simplicity can be depicted as:

void release()
{
    if (ref_count.fetch_sub(1, std::memory_order_acq_rel) == 1)
    {
        delete this;
    }
}

Why doesn't libc++ split it into release decrement and acquire fence?

void release()
{
    if (ref_count.fetch_sub(1, std::memory_order_release) == 1)
    {
        std::atomic_thread_fence(std::memory_order_acquire);
        delete this;
    }
}

as Boost recommends, which looks superior as it doesn't impose acquire mem order for all but the last decrement.

Selena answered 17/6, 2022 at 14:42 Comment(9)
Related: #72607395. Comments suggest that in some cases, relaxed load + acquire fence can be more expensive than acquire load. So it may depend on which is considered to be the more common case. I do imagine there is a lot of code that uses shared_ptr but never actually shares it.Younger
thanks @NateEldredge, this may be a reason. I'm more interested if it's correct to use rel decrement + acq fence for shared ptr's reference controller. I'm working with a shared ptr implementation that uses this and TSAN reports a race between T1 fetch_sub and T2 destruction of the reference controllerSelena
The link is to an archived old version of libc++. The current one is at github.com/llvm/llvm-project/blob/main/libcxx/include/__memory/…, but nothing changed about the approach.Colloid
It's possibly related to std::atomic_thread_fence being a more restrictive fence than one tied to an operation.Pantry
@DrewDormann theoretically or is there any platform where it’s faster than two acq loads? Two comes from the assumption that usually there’re at least two refsSelena
@NateEldredge 'I do imagine there is a lot of code that uses shared_ptr but never actually shares it.' – well, I really hope this didn't have any influence on the decision!!! Using some tool wrong, even if majority does, shouldn't ever punish those using it correctly...Klement
@Aconcagua: That's true, but in a program where some shared_ptr objects are actually shared, but many others aren't, the non-shared case is still the common one. But the code-gen would still have to be correct for actually-shared objects, so you can't just avoid the work entirely.Durable
Both versions are certainly correct, and you could write out a proof if you are concerned. TSAN doesn't properly recognize the effect of fences, see #70543493, so it's not surprising that it would report a false positive data race.Younger
But if one was going to choose based on performance, it would have to be a heuristic. On a machine where an acquire fence is slower than an acquire load, one would have to trade it off against the relative frequency of the two cases, and I think it's very plausible that the unconditional acquire load could be faster on average than the conditional acquire fence. And of course, on other machines it makes no difference at all. On x86 every atomic RMW is automatically seq_cst and std::atomic_thread_fence(std::memory_order_acquire) is a no-op.Younger

© 2022 - 2024 — McMap. All rights reserved.