Comparing 2 method of C++ ref-counting

Asked 27/2, 2023 at 20:46 Answered 14/3, 2023 at 14:23

c++memory-barriers stdatomic reference-counting

To write a ref-counted class, I have seen 2 different approaches:

Approach 1:

struct RefCounting1 {
  void ref_up() {
    m_ref.fetch_add(1, std::memory_order_relaxed);
  }

  void release() {
    if (m_ref.fetch_sub(1, std::memory_order_acq_rel) == 1) {
      delete this;
    }
  }

private:
    std::atomic_int m_ref{1};
};

Approach 2:

struct RefCounting2 {
  void ref_up() {
    m_ref.fetch_add(1, std::memory_order_relaxed);
  }

  void release() {
    int refcnt = m_ref.fetch_sub(1, std::memory_order_relaxed) - 1;
    if (refcnt == 0)
      std::atomic_thread_fence(std::memory_order_acquire);

    if (refcnt > 0)
      return;
        
    delete this;
  }

private:
    std::atomic_int m_ref{1};
};

As you see, first one uses acquire-release memory order for decreasing reference counter, but second one uses relaxed decreasing and uses an acquire fence to protect non-atomic data. I wanted to know if these 2 methods can create same effect of not. Is there any benefit for one over other or not?

Headman answered 27/2, 2023 at 20:46 Comment(17)

std::shared_ptr is already reference-counted, so I don't see a compelling reason to roll your own. – Bureaucratize 27/2, 2023 at 20:49

@Bureaucratize Where I'm using this class, everything is passed via pointers. shared_ptr reference is increased on copy. As a result, shared_ptr logic cannot be used unless I use tricks like keeping a shared pointer to self in the class which I hate. So I need to add small ref counting mechanism. – Headman 27/2, 2023 at 20:51

That sounds suspiciously like what std::weak_ptr is for ( en.cppreference.com/w/cpp/memory/weak_ptr ), but perhaps I don't understand your design well enough. – Bureaucratize 27/2, 2023 at 20:54

It depends on CPU, but to me approach 2 is slightly faster in average. On Intel x86, with a strong memory model, it doesn't matter, but for example with ARM, the atomic operation itself is quick, while the memory synchronization is quite slow. In second approach, the synchronization is done only when really needed, before deletion. – Weis 27/2, 2023 at 20:56

@Bureaucratize my design is somehow stucked to a C library and that's why I have some limitations. I have used weak_ptr before, but I could not find a way to resolved my problem with it here. The only approach that I found was to keep a copy of shared_ptr inside my class to prevent it from deleting, and when I need to delete, set this copy to nullptr. But this is not an approach that I like. That's why I decided to add a ref-counting to my class. – Headman 27/2, 2023 at 21:3

You need to consider that an atomic operation with a memory order is not the same as relaxed memory ordering followed by an explicit fence. Make sure to read preshing.com/20131125/…. [Note I dont know if your design is wrong, or whether this'll mean you need to change your design but make sure you consider this] – Outgo 27/2, 2023 at 21:3

On my code base, I see that the release method is doing a decrement with memory_order_release, and then there is a std::atomic_thread_fence(std::memory_order_acquire); if retain count reaches 0. I am unsure it will work with memory_order_relaxed in release, at least my conclusion when I wrote that part was no. – Weis 27/2, 2023 at 21:10

@Weis I'm pretty sure it should work, because my 2nd code is based on OpenSSL reference counting. You can see CRYPTO_REF_UP and CRYPTO_REF_DOWN here: github.com/openssl/openssl/blob/master/include/internal/… – Headman 27/2, 2023 at 21:12

A different way to avoid using a fence [which are in general much harder to reason about than operations] in the second example is when refCnt == 0 do another explicit m_ref.load(acquire) to create an artificial atomic operation. – Outgo 27/2, 2023 at 21:12

@MikeVine Optimizers may optimize and remove m_ref.load(acquire), isn't it? Because if result is not used, it maybe a candidate for optimization.... – Headman 27/2, 2023 at 21:17

@Headman I hope OpenSSL is correct. To be honest, I didn't try to guess that at that time, and looked at several shared_ptr implementations (I don't remember precisely which ones now). And I saw that memory_order_release followed by memory_order_acquire was common pattern. I wish I have the definitive answer to that question. – Weis 27/2, 2023 at 21:21

@Bureaucratize "std::shared_ptr is already reference-counted" 1) The Q doesn't even mention shared smart ptr, or memory management. There are many uses of RC for non memory resources and where we wouldn't even know what "ptr semantics" would mean, so using a memory management device/owning smart ptr wouldn't make sense (like a file handle). 2) It's important to understand what memory ordering really does and to be able to reimplement classical, basic tools, at least at the simple level (not w/ all the std classes functions). – Swob 27/2, 2023 at 21:56

@curiousguy, The code he's using is doing reference counting and deleting objects when the reference count goes to zero. That's exactly what std::shared_ptr is for. You can also use std::shared_ptr to manage resources other than memory by using a custom deleter. As for your file handle example, std::fstream does in fact close the associated file in its destructor, when the std::fstream object goes out of scope (another example of managing a non-memory resource). Yes, understanding how things work is important. The Q didn't mention doing this for learning purposes. – Bureaucratize 27/2, 2023 at 22:3

@Bureaucratize I'm sure you are not exhibiting iostream as an example of sound design? What if you had to hide a Unix fileno with a "smart object" (with RC). Would you give it pointer like semantics with operator* and operator->? – Swob 28/2, 2023 at 1:56

@MikeVine "atomic operation with a memory order is not the same as relaxed memory ordering followed by an explicit fence" I have trouble following your comment. Are you (a) saying that you can't compensate the change of any acq/rel/ack+rel MO to a relaxed MO with additional fences, or that (b) the use of a fence by the OP doesn't compensate the relaxed MO of the atomic operation? – Swob 28/2, 2023 at 2:1

Control structures style comment: if (refcnt == 0) blah; if (refcnt > 0) eturn; bloh is really ugly. I am not saying stick to SESE (single entry single exit) but I had to read it back multiple times! – Swob 28/2, 2023 at 2:4

@curiousguy: To me, most of what makes the branching ugly is that if (refcnt == 0) fence could just be an unconditional fence after the early-out if()return, or it could be an if/else. It's just weird that both fence and delete this run in the same case (assuming refcnt can't ever be negative, in which case you'd delete without having done a fence), but are on opposite sides of something else, and are controlled different ways. – Pergrim 14/3, 2023 at 15:3

As you see, first one uses acquire-release memory order for decreasing reference counter, but second one uses relaxed decreasing and uses an acquire fence to protect non-atomic data.

You can't have acquire semantics without release semantics: you can only acquire (obtain a secure view of) what others have released. Release memory order is like "publish finished memory operations of the current thread" and acquire memory order is like "get the latest view of finished memory operations by another thread".

[Why do you call your reset function release in a discussion of memory orders? Not nice!]

I wanted to know if these 2 methods can create same effect of not. Is there any benefit for one over other or not?

They don't have the same effect: the first one, the classical RC implementation (RefCounting1) is valid and your optimized alternative (RefCounting2) is broken.

You have to express the minimum synchronization needed, the axioms of the reference counted shared resource. The resource must be deallocated when the last manager class destructor (or reset member function) runs, that is, after the other destructors. There is an ordering implied: the event RC reaches zero must be after all others. That doesn't mean anyone cares that each RC atomic operation is well ordered WRT to all others, just that the last one is the last one.

RefCounting1 makes sure that is the case by ordering all the reset operations, which is overkill.

The following should create the necessary ordering:

void reset() { // not "release", please
    int refcnt = m_ref.fetch_sub(1, std::memory_order_release) - 1;
    if (refcnt == 0) {
        std::atomic_thread_fence(std::memory_order_acquire);
        delete this;
    }
}

Here the last acquire is paired with all previous releases.

Note: lack of ordering on count increase can make the exact multiplicity of the value of use_count compiler and machine dependent in MT programs if one thread makes many local copies (and the compiler can accurately track these with escape analysis); in an extreme case, the compiler could remove additional redundant thread local shared_ptr instances that do nothing but change the count, with the transformation of these spread out actions:

m_ref.fetch_add(1, std::memory_order_relaxed); // N times
m_ref.fetch_sub(1, std::memory_order_release); // N times

to:

/* m_ref.fetch_add(1, std::memory_order_relaxed); // 0 times */
m_ref.fetch_sub(0, std::memory_order_release); // 1 time

Assuming no operation with a memory order in between.

Note:

m_ref.fetch_sub(0, std::memory_order_release) can be compiled as a simple memory fence, but one might want to keep the explicit operation on a specific object in intermediate code for as long as possible, until all optimizing phases involving those are finished;
the compiler can try to push m_ref.fetch_sub(0, std::memory_order_release) as late as possible in program order, until it reaches the needed emission of release fence, and so simply remove the fetch_sub operation.

The optimization is trivially sound, and clearly a win, the difficulty is mostly following all function called to see that there is nothing "in between" that breaks the optimization.

Note: To avoid breaking progress bars and similar, and even more critical, time dependent programs (think: heartbeat), such optimizations should be avoided in functions doing heavy computations, anything that runs for long enough to notice the reordering.

The possible optimization makes the value of use_count less precise but not totally random and unreliable. There is an hard lower bound of use_count in any shared (between threads) shared_ptr family (those that are copies of each others and share a control block), if the program is correctly synchronized.

Contrapositive: if you can't prove there is a lower bound on such weakly synchronized reference count in a MT program, your program may lack synchronization on shared_ptr objects.

In other words: your program must contain synchronization to share shared_ptr families between threads, because the only way to do that is to share the value of one particular shared_ptr instance between threads.

Swob answered 28/2, 2023 at 2:52 Comment(8)

From "your optimized alternative (RefCounting2) is broken.", you mean OpenSSL is broken? Because as I mentioned in comments, the 2nd one is based on openssl implementation of Ref counting, unless I messed up something badly which I don't think so. – Headman 28/2, 2023 at 3:36

Do you confirm that ^F memory_order_release in OpenSSL brings nothing? – Swob 28/2, 2023 at 5:49

Why you think it brings nothing? Because if that ref counting does not work correctly, releasing objects as you mentioned should be faulty, isn't it? For example RSA object is ref-counted. You can check source code here: github.com/openssl/openssl/blob/master/crypto/rsa/rsa_lib.c (methods:RSA_free and RSA_ref_up)As you see, it uses those methods for ref counting. If The CRYPTO_REF_DOWN does not work correctly, releasing BN objects in RSA struct will create problems, isn't it? – Headman 28/2, 2023 at 8:22

^F is control-F, it's codename for search; when I write: brings nothing, I meant does "memory_order_release" appear in the source code? Sorry for the misunderstanding. – Swob 28/2, 2023 at 16:24

Yes, There is no memory_order_release. But there is a comment there reasoning why in here: github.com/openssl/openssl/blob/… – Headman 28/2, 2023 at 19:39

I'm baffled. "Changes to shared structure other than reference counter have to be serialized" They work under the assumption that there is a shared structure whose modification are serialized... I have no idea what that structure is; the pointed to structure will extremely often be immutable by design, even if it's protected by a mutex, the accesses that are serialized have nothing to do with the reset()/destruction of distinct shared_ptr. It's utterly illogical. If it (what?) is serialized, why even bother with atomic modifications? The programmer writing that wasn't thinking. – Swob 4/3, 2023 at 4:22

"Since it's last use of the object, destructor programmer might reason that access to mutable members doesn't have to be serialized anymore" That next comment is nonsense. 1) there is no reason to expect a non trivial dtor; data structure could have a trivial dtor and still expect destruction of referent by shared_ptr to be ordered after all uses of that data structure. 2) even if a) there was an internal mutex in the data structure; b) all members incl dtor lock that mutex, ... – Swob 4/3, 2023 at 5:0

... the mutex lifetime ends at the end of dtor, so the compiler should be able to omit locking altogether! Any mutex op that comes immediately (immediate is when no thread blocking operation occurs in-between) before the end of mutex lifetime is useless and can be removed; for atomic op, before the std::atomic object destruction, they can be turned into non atomic ops. That's because if a destruction occurs immediately after, it means that other threads can't determine that it has not already happened. (For the same reason, sometimes a non atomic op can be reordered before a load acq.) – Swob 4/3, 2023 at 5:5

The first implementation is correct while the second one is most likely incorrect. "Most likely" because theoretically there can be some synchronization code that is not shown here. But even then it is bad design since we rely on this external synchronization instead of encapsulating it. This is still highly unusual and confusing.

In general acquire/release always come in pairs. An acquire-fence by itself does not achieve anything - it requires a release-operation/release-fence to synchronize-with in order to establish a happens-before relation.

To understand why/where we need acquire/release semantic we need to understand our requirements - which operations do we need to order?

A shared pointer allows to have multiple references to a shared object, potentially by multiple different threads (otherwise it would not make sense to make the ref counting thread safe). All these threads may modify the object. Obviously these modifications must be synchronized somehow, but we don't care about the details how here. However, destruction can not be covered by the same synchronization method since it is handled by the shared pointer itself. So what we have to order is that all modifications (of all threads) happen-before destruction of the object. This can be achieved by using acquire/release operations on the ref count. The one thread that performs the last decrement which causes the ref count to drop to zero has to synchronize-with all previous decrements. So that last thread had to use acquire while all other threads have to use release. Since we do not know beforehand whether we will be the last thread the simplest solution is to use memory_order_acq_rel for all. Alternatively one can do this to make this more explicit:

    if (m_ref.fetch_sub(1, std::memory_order_release) == 1) {
      m_ref.load(std::memory_order_acquire); // synchronize with all previous fetch_sub operations
      delete this;
    }

One often overlooked but nevertheless important detail is that this only works because we have a release sequence.

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous sub-sequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation is an atomic read-modify-write operation.

Usually a an acquire-load synchronizes-with the release-store from which it takes its value, ie., this is a synchronization between two threads. In this case however we need to synchronize with all threads which at some point held a reference.

Consider the following modification order of our ref count:

T1: store(1)
T2: fetch_add(1, relaxed)
T1: fetch_sub(1, release) <- start of release sequence 1
T3: fetch_add(1, relaxed)
T2: fetch_sub(1, release) <- start of release sequence 2
T3: fetch_sub(1, release) <- start of release sequence 3

Thread T3 performs the last decrement that causes the ref count to drop to zero. Since all modifications except for the initial store are done via read-modify-write operations, every release-operation is the head of a release sequence that extends until the end of the modification order. In this example we have three active release sequences. When thread T3 then performs the acquire-load this will synchronize-with the heads of all active release-sequences, ie., with all threads that previously performed a release-fetch_sub.

I am not a fan of reasoning about reorderings or other potential compiler optimizations. The standard does not say anything about such optimizations, but instead defines an abstract machine with a defined (observable) behavior. With respect to concurrent execution, the formal concept that defines this behavior is the happens-before relation. Virtually all optimizations the compiler may perform are based on what is commonly referred to as the "as-if rule", which basically says that the compiler is free to perform all kinds optimizations or code transformations, as long as it does not change the observable behavior.
In other words, if you correctly identify your required happens-before relations and ensure that they are established correctly, then you don't need to care about compiler optimizations at all.

If thinking about these optimizations helps with your mental model, that is fine. Just be aware that ultimately what counts is that the necessary happens-before relations are in place, because this is also the model that the compiler will use to analyze the code and decide which optimizations it may perform.

In the comments it is mentioned that the second implementation is used in opensssl, but there we have the following comment:

/*
 * Changes to shared structure other than reference counter have to be
 * serialized. And any kind of serialization implies a release fence. This
 * means that by the time reference counter is decremented all other
 * changes are visible on all processors. Hence decrement itself can be
 * relaxed. In case it hits zero, object will be destructed. Since it's
 * last use of the object, destructor programmer might reason that access
 * to mutable members doesn't have to be serialized anymore, which would
 * otherwise imply an acquire fence. Hence conditional acquire fence...
 */

A shared_ptr allows multiple threads to keep a reference to a shared resource, but modifications of that resource must still be synchronized. My understanding of this comment is that they rely on the release-semantic of this synchronization, so that the acquire-fence can synchronize with it. I find this rather suspect TBH. I haven't looked at the rest of the code, so it might actually be correct, but it is at least highly unusual!

Plimsoll answered 14/3, 2023 at 14:23 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags