The problem
I want to allocate memory in one thread, and safely "lend" the pointer to another thread so it can read that memory.
I'm using a high level language that translates to C. The high level language has threads (of unspecified threading API, since it's cross-platform -- see below) and supports standard C multi-threading primitives, like atomic-compare-exchange, but it's not really documented (no usage examples). The constraints of this high-level language are:
- Each thread executes an event-processing infinite loop.
- Each thread has it's own local heap, managed by some custom allocator.
- Each thread has one "input" message queue, that can contain messages from any number of different other threads.
- The message passing queues are:
- For fixed-type messages
- Using copying
Now this is impractical for large (don't want the copy) or variable-sized (I think array-size is part of the type) messages. I want to send such messages, and here's the outline of how I want to achieve it:
- A message (either a request or a reply) can either store the "payload" inline (copied, fixed limit on total values size), or a pointer to data in the sender's heap
- The message contents (data in sender's heap) is owned by the sending thread (allocate and free)
- The receiving thread sends an ack to the sending thread when they are done with the message content
- The "sending" threads must not modify the message contents after sending them, until receiving the (ack).
- There should never be a concurrent read access on memory being written to, before the writing is done. This should be guaranteed by the message queues work-flow.
I need to know how to ensure that this works without data races. My understanding is that I need to use memory fences, but I'm not entirely sure which one (ATOMIC_RELEASE, ...) and where in the loop (or if I need any at all).
Portability considerations
Because my high-level language needs to be cross-platform, I need the answer to work on:
- Linux, MacOS, and optionally Android and iOS
- using pthreads primitives to lock message queues:
pthread_mutex_init
andpthread_mutex_lock
+pthread_mutex_unlock
- using pthreads primitives to lock message queues:
- Windows
- using Critical Section Objects to lock message queues:
InitializeCriticalSection
, andEnterCriticalSection
+LeaveCriticalSection
- using Critical Section Objects to lock message queues:
If it helps, I'm assuming the following architectures:
- Intel/AMD PC architecture for Windows/Linux/MacOS(?).
- unknown (ARM?) for iOS and Android
And using the following compilers (you can assume a "recent" version of all of them):
- MSVC on Windows
- clang on Linux
- Xcode On MacOS/iOS
- CodeWorks for Android on Android
I've only built on Windows so far, but when the app is done, I want to port it to the other platforms with minimal work. Therefore I'm trying to ensure cross-platform compatibility from the start.
Attempted Solution
Here is my assumed work-flow:
- Read all the messages from the queue, until it's empty (only block if it was totally empty).
- Call some "memory fence" here?
- Read the messages contents (target of pointers in messages), and process the messages.
- If the message is a "request", it can be processed, and new messages buffered as "replies".
- If the message is a "reply", the message content of the original "request" can be freed (implicit request "ack").
- If the message is a "reply", and it itself contains a pointer to "reply content" (instead of an "inline reply"), then a "reply-ack" must be sent too.
- Call some "memory fence" here?
- Send all the buffered messages into the appropriate message queues.
Real code is too large to post. Here is simplified (just enough to show how the shared memory is accessed) pseudocode using a mutex (like the message queues):
static pointer p = null
static mutex m = ...
static thread_A_buffer = malloc(...)
Thread-A:
do:
// Send pointer to data
int index = findFreeIndex(thread_A_buffer)
// Assume different value (not 42) every time
thread_A_buffer[index] = 42
// Call some "memory fence" here (after writing, before sending)?
lock(m)
p = &(thread_A_buffer[index])
signal()
unlock(m)
// wait for processing
// in reality, would wait for a second signal...
pointer p_a = null
do:
// sleep
lock(m)
p_a = p
unlock(m)
while (p_a != null)
// Free data
thread_A_buffer[index] = 0
freeIndex(thread_A_buffer, index)
while true
Thread-B:
while true:
// wait for data
pointer p_b = null
while (p_b == null)
lock(m)
wait()
p_b = p
unlock(m)
// Call some "memory fence" here (after receiving, before reading)?
// process data
print *p_b
// say we are done
lock(m)
p = null
// in reality, would send a second signal...
unlock(m)
Would this solution work? Reformulating the question, does Thread-B print "42"? Always, on all considered platforms and OS (pthreads and Windows CS)? Or do I need to add other threading primitives such as memory fences?
Research
I've spent hours looking at many related SO questions, and read some articles, but I'm still not totally sure. Based on @Art comment, I probably don't need to do anything. I believe this is based on this statement from the POSIX standard, 4.12 Memory Synchronization:
[...] using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads.
My problem is that this sentence doesn't clearly specify if they mean "all the accessed memory", or "only the memory accessed between lock and unlock." I have read people arguing for both cases, and even some implying it was written imprecisely on purpose, to give compiler implementers more leeway in their implementation!
Furthermore, this applies to pthreads, but I need to know how it applies to Windows threading as well.
I'll choose any answer that, based on quotes/links from either a standard documentation, or some other highly reliable source, either proves that I don't need fences or shows which fences I need, under the aforementioned platform configurations, at least for the Windows/Linux/MacOS case. If the Windows threads behave like the pthreads in this case, I'd like a link/quote for that too.
The following are some (of the best) related questions/links I read, but the presence of conflicting information causes me to doubt my understanding.
- Does pthread_mutex_lock contains memory fence instruction?
- Memory Fences - Need help to understand
- Problem with pThread sync issue
- Memory Visibility Through pthread Library?
- clarifications on full memory barriers involved by pthread mutexes
- Memory model spec in pthreads
- http://www.hpl.hp.com/techreports/2005/HPL-2005-217R1.html
- http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap04.html#tag_04_11
- https://msdn.microsoft.com/en-us/library/windows/desktop/ms684208(v=vs.85).aspx
foo = malloc(x); foo->bar = 17; pthread_mutex_lock(mtx); global = foo; pthread_mutex_unlock(mtx);
and expecting the17
to be visible in another thread throughglobal->bar
after that thread locksmtx
is not some mystical once in a lifetime "I need standard references" type of question, it's Tuesday. – Mehtawhile (p_a != null)
near the end of theThread-A
code should bewhile (p_a == null)
. The pseudo-code is confusing me. Code is not normally written with thread labels. It's just code that may have multiple threads executing it. Think of it as sender code segment and receiver code segment, then reason about execution overlays on those. – HeinsInterlockedExchangePointer
(it's much better than using CS which is often overkill). Also on Windows, CS and MemoryBarriers exists but are different concepts. IMHO, you really should read the whole "Synchronization" chapter msdn.microsoft.com/en-us/library/windows/desktop/ms686353.aspx before implementing anything. – Cramer