Read-write thread-safe smart pointer in C++, x86-64
Asked Answered
E

4

5

I develop some lock free data structure and following problem arises.

I have writer thread that creates objects on heap and wraps them in smart pointer with reference counter. I also have a lot of reader threads, that work with these objects. Code can look like this:

SmartPtr ptr;

class Reader : public Thread {
    virtual void Run {
       for (;;) {
           SmartPtr local(ptr);
           // do smth   
       }
    }   
};

class Writer : public Thread {
    virtual void Run {
       for (;;) {
           SmartPtr newPtr(new Object);    
           ptr = newPtr;  
       }
    }
};

int main() {
    Pool* pool = SystemThreadPool();
    pool->Run(new Reader());
    pool->Run(new Writer());
    for (;;) // wait for crash :(
}

When I create thread-local copy of ptr it means at least

  1. Read an address.
  2. Increment reference counter.

I can't do these two operations atomically and thus sometimes my readers work with deleted object.

The question is - what kind of smart pointer should I use to make read-write access from several threads with correct memory management possible? Solution should exist, since Java programmers don't even care about such a problem, simply relying on that all objects are references and are deleted only when nobody uses them.

For PowerPC I found http://drdobbs.com/184401888, looks nice, but uses Load-Linked and Store-Conditional instructions, that we don't have in x86.

As far I as I understand, boost pointers provide such functionality only using locks. I need lock free solution.

Ezara answered 4/11, 2011 at 9:30 Comment(9)
std::shared_ptr? (Or, boost::shared_ptr if your implementation doesn't support it yet.)Content
Is there something wrong with boost::shared_ptr?Paisano
I'm pretty sure JVM's use locks. Is execution speed the reason you want to avoid locks? Check earlier questions such as c:\Code\boost_1_47_0\boost; but I wonder if you don't have to select either a lock based solution, one dependent on specific instructions or C++11 language support, or an unsafe solution such as the one you are working with.Ca
Why lock-free? Is it the bottle-neck of your app?Delineation
I don't think std::shared_ptr have support for atomic assignments.Colombi
@PontusGagge: JVM does not use locks, and does not need to. It does no reference counting so copying a reference is a single atomic instruction (on the other hand, GC usually stop the whole world...)Maury
@ronag: The operations both in boost::shared_ptr and std::shared_ptr are thread safe to some extent (i.e. as long as only one thread is modifying a particular shared_ptr, even if many threads access the same reference counted object through different shared_ptrs)Maury
@David: I'm aware that shared_ptr have some thread safety. However, that is only for the reference counting? An assignment as in the question is not thread-safe as far as I know.Colombi
@PontusGagge, execution speed is important, besides lock-free algorithm is desirable.Ezara
C
9

boost::shared_ptr have atomic_store which uses a "lock-free" spinlock which should be fast enough for 99% of possible cases.

    boost::shared_ptr<Object> ptr;
class Reader : public Thread {
    virtual void Run {
       for (;;) {
           boost::shared_ptr<Object> local(boost::atomic_load(&ptr));
           // do smth   
       }
    }   
};

class Writer : public Thread {
    virtual void Run {
       for (;;) {
           boost::shared_ptr<Object> newPtr(new Object);    
           boost::atomic_store(&ptr, newPtr);
       }
    }
};

int main() {
    Pool* pool = SystemThreadPool();
    pool->Run(new Reader());
    pool->Run(new Writer());
    for (;;)
}

EDIT:

In response to comment below, the implementation is in "boost/shared_ptr.hpp"...

template<class T> void atomic_store( shared_ptr<T> * p, shared_ptr<T> r )
{
    boost::detail::spinlock_pool<2>::scoped_lock lock( p );
    p->swap( r );
}

template<class T> shared_ptr<T> atomic_exchange( shared_ptr<T> * p, shared_ptr<T> r )
{
    boost::detail::spinlock & sp = boost::detail::spinlock_pool<2>::spinlock_for( p );

    sp.lock();
    p->swap( r );
    sp.unlock();

    return r; // return std::move( r )
}
Colombi answered 4/11, 2011 at 9:42 Comment(14)
Yep, show us an implementation of atomic_swap and your answer will be complete.Jaleesa
The implementation is in boost/shared_ptr.hpp. It's a simple spin-lock as I mentioned in my answer.Colombi
+1 Thanks. I've suggested a lock-free implementation in another answer, but a spinlock would be simpler. Just as good too, provided that it doesn't fall back to a mutex lock after a certain number of spins.Jaleesa
Just make sure you're running the threads on different cores, otherwise a spin-lock is pretty counter-productive, and a mutex fallback can actually be very beneficial!Lajoie
Don't you also need the corresponding atomic_load in the reader thread?Babineaux
Well, thanks, it's not the worst solution (especially if we have more than 41 spinlocks in pool, as we have in boost). But it's not lock free, I would prefere a lock free one.Ezara
My advice is that you keep it simple, you will NOT see any difference between this spinlock and a lock-free solution, unless you have a lot of threads contending (+8 or more). Remember that a CAS instruction is not free, so if you have low contention, a spin-lock might actually be faster. That's my two cents on the matter. Good luck with either solution, RobH solution will be lock-free.Colombi
@Ronag I deal with 16 threads. Yeah, CAS is expensive, but spinlock is also implemented with CAS.Ezara
@Rizar: You are right that spinlock uses CAS, but its only 4 byte in comparison to 16 byte. If you deal with that many threads there might be enough of contention to bother with a more complicated lock-free solution.Colombi
@zennehoy: Why does it matter if it runs on the same core or not? a spinlock will almost immediately yield to another thread.Colombi
@ronag: That's precisely what a spinlock will not do. A spinlock is a busy wait, effectively a while loop that loops until the lock is released. This is great when synchronizing with a thread running on a different core (since no context switch is needed), but awful if both threads are running on the same core, since a context switch would be necessary and the spin-lock delays the context switch until one is forced by the OS.Lajoie
@zennehoy: I think it depends on the implementation. I've seen plenty of spin-locks that yield to other threads, both TBB and boost spin-locks will yield.Colombi
@ronag: As soon as the spin-lock yields (causes a context switch), it ceases to be a true spin-lock. Note that I'm not arguing that yielding isn't a good idea - it is, specifically in the case where the thread holding the lock is not currently running.Lajoie
"a "lock-free" spinlock" Hug?Stagner
J
4

With some jiggery-pokery you should be able to accomplish this using InterlockedCompareExchange128. Store the reference count and pointer in a 2 element __int64 array. If reference count is in array[0] and pointer in array[1] the atomic update would look like this:

while(true)
{
    __int64 comparand[2];
    comparand[0] = refCount;
    comparand[1] = pointer;
    if(1 == InterlockedCompareExchange128(
        array,
        pointer,
        refCount + 1,
        comparand))
    {
        // Pointer is ready for use. Exit the while loop.
    }
}

If an InterlockedCompareExchange128 intrinsic function isn't available for your compiler then you may use the underlying CMPXCHG16B instruction instead, if you don't mind mucking around in assembly language.

Jaleesa answered 4/11, 2011 at 10:4 Comment(12)
Shouldn't InterlockedCompareExchange64 suffice for x86?Lajoie
I see that the CMPXCHG16B instruction is available in the x64-64 instruction set, so whatever compiler intrinsic function uses that will work. Or OP could dive into an __asm section :-oJaleesa
@Lajoie Yes, it would. OP is in 64 bit though, so InterlockedCompareExchange128 is necessary to deal with both a pointer and a reference count atomically.Jaleesa
Erm, -1? Does somebody have a problem with this answer?Jaleesa
x86-64 == x64, so InterlockedCompareExchange128 would be available. I'm just saying that the solution will work on both x86 and x64, either with I-C-E-64 or -128 respectively. No idea why this got a down-vote, have an upvote.Lajoie
@RobH: I misread the OPs question. Please edit your question (add a whitepsace) so that I can upvote it.Colombi
@Ezara Usually a smart pointer implementation will involve a reference-count-and-pointer object which is itself shared between smart pointer instances. The code above will help with updating that shared object atomically. Copying the pointer would involve updating the shared object (to bump the reference count and check whether the underlying object has been deleted already) then initialising a new smart pointer with a pointer to that shared object.Jaleesa
@RobH: Undeleting my answer again, since this is exactly the problem I had and solved by delaying the delete.Lajoie
I'm sure it is a solveable problem, as boost::shared_ptr manages to do it. Presumably that doesn't use a global mega-lock (somebody earlier says that it uses locks) so that lock will be part of the shared object. So, it must be possible to control the lifetime of that shared object safely.Jaleesa
@RobH: Thinking about this some more, I think shared_ptr doesn't have this problem. When copying a shared_ptr, you always need a second shared_ptr (the one you're copying from) that will remain valid throughout the copy. Thus the reference count can never go down to 0 during the copy, and nothing can get deleted.Lajoie
I was thinking along those lines myself. It occurred to me though that the shared_ptr might be reset() from another thread.Jaleesa
@Jaleesa "that the shared_ptr might be reset() from another thread." this is not correct usage.Stagner
B
2

The solution proposed by RobH doesn't work. It has the same problem as the original question: when accessing the reference count object, it might already have been deleted.

The only way I see of solving the problem without a global lock (as in boost::atomic_store) or conditional read/write instructions is to somehow delay the destruction of the object (or the shared reference count object if such thing is used). So zennehoy has a good idea but his method is too unsafe.

The way I might do it is by keeping copies of all the pointers in the writer thread so that the writer can control the destruction of the objects:

class Writer : public Thread {
    virtual void Run() {
        list<SmartPtr> ptrs; //list that holds all the old ptr values        

        for (;;) {
            SmartPtr newPtr(new Object);
            if(ptr)
                ptrs.push_back(ptr); //push previous pointer into the list
            ptr = newPtr;

            //Periodically go through the list and destroy objects that are not
            //referenced by other threads
            for(auto it=ptrs.begin(); it!=ptrs.end(); )
                if(it->refCount()==1)
                    it = ptrs.erase(it);
                else
                    ++it;
       }
    }
};

However there are still requirements for the smart pointer class. This doesn't work with shared_ptr as the reads and writes are not atomic. It almost works with boost::intrusive_ptr. The assignment on intrusive_ptr is implemented like this (pseudocode):

//create temporary from rhs
tmp.ptr = rhs.ptr;
if(tmp.ptr)
    intrusive_ptr_add_ref(tmp.ptr);

//swap(tmp,lhs)
T* x = lhs.ptr;
lhs.ptr = tmp.ptr;
tmp.ptr = x;

//destroy temporary
if(tmp.ptr)
    intrusive_ptr_release(tmp.ptr);

As far as I understand the only thing missing here is a compiler level memory fence before lhs.ptr = tmp.ptr;. With that added, both reading rhs and writing lhs would be thread-safe under strict conditions: 1) x86 or x64 architecture 2) atomic reference counting 3) rhs refcount must not go to zero during the assignment (guaranteed by the Writer code above) 4) only one thread writing to lhs (using CAS you could have several writers).

Anyway, you could create your own smart pointer class based on intrusive_ptr with necessary changes. Definitely easier than re-implementing shared_ptr. And besides, if you want performance, intrusive is the way to go.

Babineaux answered 4/11, 2011 at 21:11 Comment(1)
Thanks, you answer is much like what I think.Ezara
L
-2

The reason this works much more easily in java is garbage collection. In C++, you have to manually ensure that a value is not just starting to be used by a different thread when you want to delete it.

A solution I've used in a similar situation is to simply delay the deletion of the value. I create a separate thread that iterates through a list of things to be deleted. When I want to delete something, I add it to this list with a timestamp. The deleting thread waits until some fixed time after this timestamp before actually deleting the value. You just have to make sure that the delay is large enough to guarantee that any temporary use of the value has completed.

100 milliseconds would have been enough in my case, I chose a few seconds to be safe.

Lajoie answered 4/11, 2011 at 9:52 Comment(3)
Sounds to me like a rather complicated and risky solution.Colombi
It was the only solution I could come up with for the deletion of nodes in a lock-free linked list. The problem was that by the time a node was marked for deletion, another thread could have started using it (e.g. by iterating to it). If you know of any alternatives, I'd love to hear about them!Lajoie
Both mine and RobHs are better alternatives.Colombi

© 2022 - 2024 — McMap. All rights reserved.