Can someone help spot the errors in my low lock list?
Asked Answered
M

4

5

I've written a low lock list in C++ on windows 32-bit. I'm getting BIG improvements over using critical sections but I'd like someone to sanity check that what I'm doing is correct and there aren't any errors in what I've done:

#ifndef __LOW_LOCK_STACK_H_
#define __LOW_LOCK_STACK_H_

template< class T > class LowLockStack
{
protected:
    struct Entry
    {
        Entry*  pNext;
        T*              pData;
    };
    union Header
    {
        __int64         m_XChg;
        struct
        {
                Entry*  m_pNext;
                __int16 m_Depth;
                __int16 m_Counter;
        };
    };

    Header      m_Header;
public:
    LowLockStack()
    {
        m_Header.m_pNext        = NULL;
        m_Header.m_Depth        = 0;
        m_Header.m_Counter  = 0;
    }

    ~LowLockStack()
    {
    }

    void PushEntry( T* pData )
    {
        Entry* pEntry   = new Entry;
        pEntry->pData   = pData;

        Header header;
        Header xchg;
        do
        {
            xchg.m_XChg   = m_Header.m_XChg;

            header.m_pNext  = pEntry;
            header.m_Depth  = xchg.m_Depth + 1;
            header.m_Counter = xchg.m_Counter + 1;
            pEntry->pNext  = xchg.m_pNext;
        } while( _InterlockedCompareExchange64( &m_Header.m_XChg, header.m_XChg, xchg.m_XChg ) != xchg.m_XChg );
    }

    T* PopEntry()
    {
        Entry* pEntry   = NULL;
        Header header;
        Header xchg;
        do
        {
            xchg.m_XChg    = m_Header.m_XChg;

            pEntry                  = xchg.m_pNext;
            if ( pEntry == NULL )
            {
                 return NULL;
            }

            header.m_pNext  = pEntry->pNext;
            header.m_Depth  = xchg.m_Depth - 1;

        } while( _InterlockedCompareExchange64( &m_Header.m_XChg, header.m_XChg, xchg.m_XChg ) != xchg.m_XChg );

        T* pRet = pEntry->pData;
        delete pEntry;

        return pRet;
    }

    __int32 GetDepth()
    {
        return m_Header.m_Depth;
    }
};

#endif

If there are no errors (which i doubt ;)) then think of it as a reference implementation :D

Edit: I've updated the code taking into account a number of criticisms.

Meteoroid answered 11/12, 2009 at 16:24 Comment(3)
Can you quantify the perf gain you see over critical sections? How contended is access to the list? There may ways to get even better performance than lock free (for instance, per-thread queues).Ramification
Quantify the perfomance gain? I've not done enough testing to quantify it. What I CAN say is that using CriticalSections performance DECREASED as I threw more threads at the problem. In fact single thread gave by far the best performance. With the new system a threaded section of the code dropped from taking ~0.6 seconds to ~0.2 when split across 4 threads. Not perfect but a hell of an improvement over the old system, i'm sure you'll agree.Meteoroid
Of course the contention issues are pronounced due to the speed that each task goes. I had no idea Log testing 16384 MFCCs of 12 features could be quite as fast as it was. The old code spent the vast majority of the time contending the locks. Obviously with slower running tasks there would be less contention ...Meteoroid
I
2

You do not synchronize access to the list header member. This is bad at least on 2 levels:

  • assigning values to the list header can be not as atomic as you think. Which means that an unsynchronized read operation can potentially get a corrupted value.

  • another, more probable issue with this is that if your box has multiple cores, every one of them can have in the processor cache its own copy of the value. To make them synchronize the values you need a memory barrier

Impersonality answered 11/12, 2009 at 16:42 Comment(11)
InterlockedCompareExchange64 provides a memory barrier.Ramification
Than this is even worse than it seems - memory barrier is an expensive operation and this implementation will have O(n) of them for the list with n elementsImpersonality
There's typically no escaping memory barriers. An implementation using locks, even in the case of no contention, would still have memory barriers (if locks didn't provide barriers, they'd be useless.)Ramification
Absolutely - but you do not have to have this many of them. My guess one per call should be enoughImpersonality
Ideally it is one per call. The retry is only done if the head node gets changed after fetching the data but before the exchange.Ramification
@Ramification BTW in one particular case I found a way to escape the locks (sort of) check this out: bit.ly/7NzATxImpersonality
Well in answer to point 1 .. surely this isn't a major problem as it it will, simply, cause the CompareExchange to fail and it will spin and try again? As for 2 ... yeah CompareExchange IS a memory barrier so i'm not worried about that ...Meteoroid
@Meteoroid - it may just throw an exceptionImpersonality
How could it throw an exception? I think I'm confused as to what you mean then ...Meteoroid
This is a pointer to some memory location. If you replace one valid pointer with another one both represent a valid memory location, but if you try to use the pointer value while only the first 2 bytes are replaced with the bytes of the new value - there is no guarantee this will be a valid pointerImpersonality
@mfeingol:massively late reponse but that doesn't happen. The compareexchange is an atomic operation so you can guarantee that the whole pointer gets swapped across in one move (and no other processory can stop that atomicity).Meteoroid
R
4

As you discovered, lock free programming is very difficult to get right.

Windows already has support for lock-free singly linked lists,http://msdn.microsoft.com/en-us/library/ms684121(VS.85).aspx, you should try using that rather than roll your own.

Ramification answered 11/12, 2009 at 16:46 Comment(1)
No I'm fully aware of it. But the reason I've done this is so that I can easily re-implement for other platforms ... Its more of a learning exercise than anything :)Meteoroid
I
2

You do not synchronize access to the list header member. This is bad at least on 2 levels:

  • assigning values to the list header can be not as atomic as you think. Which means that an unsynchronized read operation can potentially get a corrupted value.

  • another, more probable issue with this is that if your box has multiple cores, every one of them can have in the processor cache its own copy of the value. To make them synchronize the values you need a memory barrier

Impersonality answered 11/12, 2009 at 16:42 Comment(11)
InterlockedCompareExchange64 provides a memory barrier.Ramification
Than this is even worse than it seems - memory barrier is an expensive operation and this implementation will have O(n) of them for the list with n elementsImpersonality
There's typically no escaping memory barriers. An implementation using locks, even in the case of no contention, would still have memory barriers (if locks didn't provide barriers, they'd be useless.)Ramification
Absolutely - but you do not have to have this many of them. My guess one per call should be enoughImpersonality
Ideally it is one per call. The retry is only done if the head node gets changed after fetching the data but before the exchange.Ramification
@Ramification BTW in one particular case I found a way to escape the locks (sort of) check this out: bit.ly/7NzATxImpersonality
Well in answer to point 1 .. surely this isn't a major problem as it it will, simply, cause the CompareExchange to fail and it will spin and try again? As for 2 ... yeah CompareExchange IS a memory barrier so i'm not worried about that ...Meteoroid
@Meteoroid - it may just throw an exceptionImpersonality
How could it throw an exception? I think I'm confused as to what you mean then ...Meteoroid
This is a pointer to some memory location. If you replace one valid pointer with another one both represent a valid memory location, but if you try to use the pointer value while only the first 2 bytes are replaced with the bytes of the new value - there is no guarantee this will be a valid pointerImpersonality
@mfeingol:massively late reponse but that doesn't happen. The compareexchange is an atomic operation so you can guarantee that the whole pointer gets swapped across in one move (and no other processory can stop that atomicity).Meteoroid
D
1

The most obvious error is the name you've given it. Regardless of the fact that you've implemented it as a linked list, what you've implemented is a stack.

Dickey answered 11/12, 2009 at 16:49 Comment(3)
Whats wrong with using cmpxchg16b instead? That would solve the 64-bit issues ...Meteoroid
@Goz:cmpxchg16b is fine, except that VC++ doesn't provide a direct front-end for it and not all processors include it. From VC++, you get InterlockedCompare64Exchange128 instead. I think that's enough for what you're doing right now, but you need to be careful that you only need the comparison on 64 bits, even though you're exchanging 128.Dickey
@Goz:Oops -- somehow or other I missed that. Thanks for the pointer.Dickey
G
1

Consider what would happen in the following sequence of events when there are two items, A and B, in the list (stack really) like head -> A -> B, count is 2:

  1. Thread 1 starts the pop() call but gets preempted right before _InterlockedCompareExchange64()
  2. Thread 2 removes both items, A and B, from the stack and then puts two new items back onto the stack and the top item happens to be allocated at the same address as A, so we have, say head -> A -> D. Note that the count is back to 2
  3. Thread 1 resumes and successfully performs the CAS ( _InterlockedCompareExchange64()). Now head points to B, which was deallocated (bad) and D is lost (bad).

This is a classic ABA problem. You should be using the second word as a generation number instead of item count, i.e. do not ever decrement it.

There's a mailing list discussion going on right now about experimental boost::lockfree library.

Also take a look at Herb Sutter's lock-free queue - this is a different approach where a dummy node keeps producer and consumer from stepping on each other.

Goldie answered 11/12, 2009 at 17:43 Comment(4)
OK. Thats a good point. I assume I could easily get round this by making count only 16-bit and then incrementing a number in the high 16-bits each time a pop is made? This way I only run into the problem if 64K items are added while the other thread is spinning which seems a tad unlikely ...Meteoroid
Yes, but you are just reducing the probability :)Goldie
Point taken ... btw I've implemented herb sutter's approach. I found the critical section based system out performed it ... I was rather unimpressed, ESPECIALLY, given how impressed I Usually am with Mr Sutter ...Meteoroid
Looking at Microsoft's SList .. they appear to be doing what I suggested above ... I should properly backward engineer their code to see what they do though ...Meteoroid

© 2022 - 2024 — McMap. All rights reserved.