seems to me if your interference app were using new/delete (malloc/free), then the interference app's would interfere with the non recycle test more. But I don't know how your interference test is implemented.
Depending on how you recycle (ie if you use pthread mutexes god forbid) your recycle code could be slow (gcc atomic ops would be 40x faster at implementing recycle).
Malloc, in some variation for a long time on at least some platforms, has been aware of threads. Use the compiler switches on gcc to be sure you get it. Newer algorithms maintain pools of small memory chunks for each thread, so there is no or little blocking if your thread has the small item available. I have over simplified this and it depends on what malloc your system is using. Plus, if you go and allocate millions of items to do a test....well then you wont see that effect, because the small item pools are limited in size. Or maybe you will. I don't know. If you freed the item right after allocating, you would be more likely to see it. Freed small items go back into the small item lists rather than the shared heap. Although "what happens when thread B frees an item allocated by thread A" is a problem that may or may not be dealt with on your version of malloc and may not be dealt with in a non blocking manner. For sure, if you didn't immediately free during a large test, then the the thread would have to refill its small item list many times. That can block if more than one thread tries. Finally, at some point your process' heap will ask the system for heap memory, which can obviously block.
So are you using small memory items? For your malloc I don't know what small would be, but if you are < 1k that is for sure small. Are you allocating and freeing one after the other, or allocating thousands of nodes and then freeing thousands of nodes? Was your interference app allocating? All of these things will affect the results.
How to recycle with atomic ops (CAS = compare and swap):
First add a pNextFreeNode to your node object. I used void*, you can use your type. This code is for 32 bit pointers, but works for 64 bit as well. Then make a global recycle pile.
void *_pRecycleHead; // global head of recycle list.
Add to recycle pile:
void *Old;
while (1) { // concurrency loop
Old = _pRecycleHead; // copy the state of the world. We operate on the copy
pFreedNode->pNextFreeNode = Old; // chain the new node to the current head of recycled items
if (CAS(&_pRecycleHead, Old, pFreedNode)) // switch head of recycled items to new node
break; // success
}
remove from pile:
void *Old;
while (Old = _pRecycleHead) { // concurrency loop, only look for recycled items if the head aint null
if (CAS(&_pRecycleHead, Old, Old->pNextFreeNode)) // switch head to head->next.
break; // success
}
pNodeYoucanUseNow = Old;
Using CAS means the operation will succeed only if the item you are changing is the Old value you pass in. If there is a race and another thread got there first, then the old value will be different. In real life this race happens very very rarely. CAS is only slighlty slower than actually setting a value so compared to mutexes....it rocks.
The remove from pile, above, has a race condition if you add and remove the same item rapidly. We solve that by adding a version # to the CAS'able data. If you do the version # at the same time as the pointer to the head of the recycle pile you win. Use a union. Costs nothing extra to CAS 64 bits.
union TRecycle {
struct {
int iVersion;
void *pRecycleHead;
} ; // we can set these. Note, i didn't name this struct. You may have to if you want ANSI
unsigned long long n64; // we cas this
}
Note, You will have to go to 128 bit struct for 64 bit OS. so the global recycle pile looks like this now:
TRecycle _RecycleHead;
Add to recycle pile:
while (1) { // concurrency loop
TRecycle New,Old;
Old.n64 = _RecycleHead.n64; // copy state
New.n64 = Old.n64; // new state starts as a copy
pFreedNode->pNextFreeNode = Old.pRecycleHead; // link item to be recycled into recycle pile
New.pRecycleHead = pFreedNode; // make the new state
New.iVersion++; // adding item to list increments the version.
if (CAS(&_RecycleHead.n64, Old.n64, New.n64)) // now if version changed...we fail
break; // success
}
remove from pile:
while (1) { // concurrency loop
TRecycle New,Old;
Old.n64 = _RecycleHead.n64; // copy state
New.n64 = Old.n64; // new state starts as a copy
New.pRecycleHead = New.pRecycledHead.pNextFreeNode; // new will skip over first item in recycle list so we can have that item.
New.iVersion++; // taking an item off the list increments the version.
if (CAS(&_RecycleHead.n64, Old.n64, New.n64)) // we fail if version is different.
break; // success
}
pNodeYouCanUseNow = Old.pRecycledHead;
I bet if you recycle this way you will see a perf increase.
mmap
and other system calls to acquire memory can take quite some time. There might not be a single answer though, I imagine that some implementations may not block when one thread reuse memory while the other performs ammap
but do block if both must perform a system call, etc... – Cu