So we're using a version of boost which is pretty old for now, and until upgrading I need to have an atomic CAS operation in C++ for my code. (we're not using C++0x yet either)
I created the following cas function:
inline uint32_t CAS(volatile uint32_t *mem, uint32_t with, uint32_t cmp)
{
uint32_t prev = cmp;
// This version by Mans Rullgard of Pathscale
__asm__ __volatile__ ( "lock\n\t"
"cmpxchg %2,%0"
: "+m"(*mem), "+a"(prev)
: "r"(with)
: "cc");
return prev;
}
My code which uses the function is somewhat as following:
void myFunc(uint32_t &masterDeserialize )
{
std::ostringstream debugStream;
unsigned int tid = pthread_self();
debugStream << "myFunc, threadId: " << tid << " masterDeserialize= " << masterDeserialize << " masterAddress = " << &masterDeserialize << std::endl;
// memory fence
__asm__ __volatile__ ("" ::: "memory");
uint32_t retMaster = CAS(&masterDeserialize, 1, 0);
debugStream << "After cas, threadid = " << tid << " retMaster = " << retMaster << " MasterDeserialize = " << masterDeserialize << " masterAddress = " << &masterDeserialize << std::endl;
if(retMaster != 0) // not master deserializer.
{
debugStream << "getConfigurationRowField, threadId: " << tid << " NOT master. retMaster = " << retMaster << std::endl;
DO SOMETHING...
}
else
{
debugStream << "getConfigurationRowField, threadId: " << tid << " MASTER. retMaster = " << retMaster << std::endl;
DO SOME LOGIC
// Signal we're done deserializing.
masterDeserialize = 0;
}
std::cout << debugStream.str();
}
My test of this code spawns 10 threads, and signals all of them to call the function with the same masterDeserialize variable.
This works well most of the time, but once every couple of thousand - couple of million test iterations 2 threads can both enter the path of acquiring the MASTER lock.
I'm not sure how this is possible, or how to avoid it.
I tried to use a memory fence before the resetting of the masterDeserialize, thinking that the cpu OOO can have affect, but this has no affect on the result.
Obviously this runs on a machine with many cores, and it is compiled in debug mode, so GCC should not reorder execution for optimizations.
Any suggestions as to what is wrong with the above?
EDIT: I tried using gcc primitive instead of assembly code, got the same result.
inline uint32_t CAS(volatile uint32_t *mem, uint32_t with, uint32_t cmp)
{
return __sync_val_compare_and_swap(mem, cmp, with);
}
I am running on a multi core, multi cpu machine, but it is a Virtual machine, is it possible that this behavior is caused somehow by the VM?
masterDeserialize
should be an atomic operation as well. – Correspondent