Safely "lend" memory block to another thread in C, assuming no "concurrent access"

Asked 3/11, 2017 at 7:41 Answered 21/1, 2018 at 10:19

Solved c multithreading thread-safety nim-lang

The problem

I want to allocate memory in one thread, and safely "lend" the pointer to another thread so it can read that memory.

I'm using a high level language that translates to C. The high level language has threads (of unspecified threading API, since it's cross-platform -- see below) and supports standard C multi-threading primitives, like atomic-compare-exchange, but it's not really documented (no usage examples). The constraints of this high-level language are:

Each thread executes an event-processing infinite loop.
Each thread has it's own local heap, managed by some custom allocator.
Each thread has one "input" message queue, that can contain messages from any number of different other threads.
The message passing queues are:
1. For fixed-type messages
2. Using copying

Now this is impractical for large (don't want the copy) or variable-sized (I think array-size is part of the type) messages. I want to send such messages, and here's the outline of how I want to achieve it:

A message (either a request or a reply) can either store the "payload" inline (copied, fixed limit on total values size), or a pointer to data in the sender's heap
The message contents (data in sender's heap) is owned by the sending thread (allocate and free)
The receiving thread sends an ack to the sending thread when they are done with the message content
The "sending" threads must not modify the message contents after sending them, until receiving the (ack).
There should never be a concurrent read access on memory being written to, before the writing is done. This should be guaranteed by the message queues work-flow.

I need to know how to ensure that this works without data races. My understanding is that I need to use memory fences, but I'm not entirely sure which one (ATOMIC_RELEASE, ...) and where in the loop (or if I need any at all).

Portability considerations

Because my high-level language needs to be cross-platform, I need the answer to work on:

Linux, MacOS, and optionally Android and iOS
- using pthreads primitives to lock message queues: pthread_mutex_init and pthread_mutex_lock + pthread_mutex_unlock
Windows
- using Critical Section Objects to lock message queues: InitializeCriticalSection, and EnterCriticalSection + LeaveCriticalSection

If it helps, I'm assuming the following architectures:

Intel/AMD PC architecture for Windows/Linux/MacOS(?).
unknown (ARM?) for iOS and Android

And using the following compilers (you can assume a "recent" version of all of them):

MSVC on Windows
clang on Linux
Xcode On MacOS/iOS
CodeWorks for Android on Android

I've only built on Windows so far, but when the app is done, I want to port it to the other platforms with minimal work. Therefore I'm trying to ensure cross-platform compatibility from the start.

Attempted Solution

Here is my assumed work-flow:

Read all the messages from the queue, until it's empty (only block if it was totally empty).
Call some "memory fence" here?
Read the messages contents (target of pointers in messages), and process the messages.
- If the message is a "request", it can be processed, and new messages buffered as "replies".
- If the message is a "reply", the message content of the original "request" can be freed (implicit request "ack").
- If the message is a "reply", and it itself contains a pointer to "reply content" (instead of an "inline reply"), then a "reply-ack" must be sent too.
Call some "memory fence" here?
Send all the buffered messages into the appropriate message queues.

Real code is too large to post. Here is simplified (just enough to show how the shared memory is accessed) pseudocode using a mutex (like the message queues):

static pointer p = null
static mutex m = ...
static thread_A_buffer = malloc(...)

Thread-A:
  do:
    // Send pointer to data
    int index = findFreeIndex(thread_A_buffer)
    // Assume different value (not 42) every time
    thread_A_buffer[index] = 42
    // Call some "memory fence" here (after writing, before sending)?
    lock(m)
    p = &(thread_A_buffer[index])
    signal()
    unlock(m)
    // wait for processing
    // in reality, would wait for a second signal...
    pointer p_a = null
    do:
      // sleep
      lock(m)
      p_a = p
      unlock(m)
    while (p_a != null)
    // Free data
    thread_A_buffer[index] = 0
    freeIndex(thread_A_buffer, index)
  while true

Thread-B:
  while true:
    // wait for data
    pointer p_b = null
    while (p_b == null)
      lock(m)
      wait()
      p_b = p
      unlock(m)
    // Call some "memory fence" here (after receiving, before reading)?
    // process data
    print *p_b
    // say we are done
    lock(m)
    p = null
    // in reality, would send a second signal...
    unlock(m)

Would this solution work? Reformulating the question, does Thread-B print "42"? Always, on all considered platforms and OS (pthreads and Windows CS)? Or do I need to add other threading primitives such as memory fences?

Research

I've spent hours looking at many related SO questions, and read some articles, but I'm still not totally sure. Based on @Art comment, I probably don't need to do anything. I believe this is based on this statement from the POSIX standard, 4.12 Memory Synchronization:

[...] using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads.

My problem is that this sentence doesn't clearly specify if they mean "all the accessed memory", or "only the memory accessed between lock and unlock." I have read people arguing for both cases, and even some implying it was written imprecisely on purpose, to give compiler implementers more leeway in their implementation!

Furthermore, this applies to pthreads, but I need to know how it applies to Windows threading as well.

I'll choose any answer that, based on quotes/links from either a standard documentation, or some other highly reliable source, either proves that I don't need fences or shows which fences I need, under the aforementioned platform configurations, at least for the Windows/Linux/MacOS case. If the Windows threads behave like the pthreads in this case, I'd like a link/quote for that too.

The following are some (of the best) related questions/links I read, but the presence of conflicting information causes me to doubt my understanding.

Deckert answered 3/11, 2017 at 7:41 Comment(23)

How do you signal between the threads that there are messages to be read? In general that mechanism should probably end up being the synchronisation point. You probably need something like "lock(queue); insert(queue, msg); unlock(queue);". In pthreads the various lock functions guarantee that any memory writes by one thread done before it unlocks a lock are visible in another thread after it locks the same lock (it's defined differently by the standard, but this is the safest way to look at it). – Mehta 3/11, 2017 at 8:0

I'd love to know if there is a standard way of doing this in C, but I'm not sure there is. You just have to be disciplined and structure your code such that nothing uses the data in the pointer once it has been passed between threads. – Unbraid 3/11, 2017 at 8:42

The message queue does some fancy stuff, to implement arbitrary object type copy automatically, but to actually transfer the data, they use a lock. Since the reader should block on empty queue, it's the most obvious design. – Deckert 3/11, 2017 at 9:30

Such inter-thread comms, passing buffer/struct/whatever pointers over producer-consumer queues, is fine, and very common. I'm less enamoured with the idea that the message contents be permanently 'owned' by anything. I much prefer that any message stand alone and can be passed around as required, ie. the originator loses all 'ownership' rights as soon as it posts it off. It may get the message back later, but while it's on leave, it should be able to travel anywhere it wants :). – Freiburg 3/11, 2017 at 9:56

The language provides a thread-local allocator, so I cannot 'give away' the memory to the receiving thread, because it would not be able to safely 'free' it. I forgot to mention that I plan to use a fixed-size thread pool, so I could build a N (N*N ?) Matrix of atomic variables, if this helps. – Deckert 3/11, 2017 at 10:24

What is the mystery "high-level language"? – Meghannmegiddo 5/11, 2017 at 15:36

Why can't you allocate memory in one thread, then pass the pointer using the provided message queues, and never look at it again in the first thread? – Meghannmegiddo 5/11, 2017 at 15:37

@Meghannmegiddo It's Nim-lang. But since it transpiles to C (and C++ and JS), and I had a hard time getting answers about that kind of topic on their forum (limited community), I though it was safer to leave this out, and concentrate on the produced C code. Now for you second comment, that is what I hope to do, but I can't just guess/assume that it will just work. I want to actually know that what I'm doing is correct. – Deckert 5/11, 2017 at 17:55

What's the difference between a "credible source" and an SO user with an "opinion" ? I don't think the way you ask for help really incitates to help. Anyway, speaking for Windows, my humble opinion is if your question is not very clear. CS is the big safe single-process synchronization hammer for protecting a resource for simultaneous access. It doesn't say anything about operation reordering. So the answer still depends precisely on the final code. You will need a memory barrier if your code needs it. It's not because you use a CS that you'll be safe whatsoever. eg: show us some real code. – Cramer 6/11, 2017 at 7:50

Maybe it's "opinion" for you, but for anyone who has done any threads, doing something like foo = malloc(x); foo->bar = 17; pthread_mutex_lock(mtx); global = foo; pthread_mutex_unlock(mtx); and expecting the 17 to be visible in another thread through global->bar after that thread locks mtx is not some mystical once in a lifetime "I need standard references" type of question, it's Tuesday. – Mehta 6/11, 2017 at 10:0

@Art, I ment no offence; I'm asking for 'references' not because I don't trust you specifically, but because it seems to me many answers related to that topic are "opinions" in the sense that they don't usually 'point at the docs'. Coming from the JVM, where such things are explained in excruciating details in the docs, one can always point to the passage where your answer comes from. My entire codebase/design depends on this code being correct. And AFAIK, you can't prove the code is OK with a test; it might just succeed because you're 'lucky', therefore I'm 'paranoid' about it. – Deckert 6/11, 2017 at 11:39

@SebastienDiot The only official reference for pthreads is POSIX. The bit of POSIX you link to is pretty clear. If you don't do X, then bad things may happen. You do X, therefore bad things won't happen. JVM gets into excruciating details because it actually defines a full memory model and there are ways to avoid locks there. There are no way to avoid locks in pthreads (in practice you can mix stdatomic.h and pthreads but AFAIK no standard defines it). POSIX defines a much weaker memory model so that non-portable platform specific optimizations aren't excluded. – Mehta 6/11, 2017 at 12:0

There's no magic "between lock and unlock" state in any major extant architecture. The pthreads memory model may or may not prohibit it, but in practice you can be quite sure it does not exist. – Urticaceous 7/11, 2017 at 6:57

@SebastienDiot I did a pretty in-depth restructuring of your question, I hope I got everything right. If you don't like it, you can roll it back but I'd encourage you try and keep something rather structured instead of the question + edits format. There is something I'm not sure I quite understood. Are you implementing the "big/variable message" language feature, so writing the C loops, or do you want to know if you should add primitives inside the high-level code? Can you maybe also point to what language this high-level language is? – Elly 7/11, 2017 at 15:16

I removed my erroneous and incomplete answer. A friend/mentor pointed out that on ARM, PPC, IA64 and some others, global total order is not preserved across cores/sockets without a fence. – Heins 7/11, 2017 at 18:11

@Elly I like the new version, but one or two things needed to be fixed. Basically, I want to "wrap" the existing message queues into something more flexible. I will do this in the "high level language" (Nim-lang), because other Nim users can then use it, even if they translate to Objective-C, instead of C, for example. – Deckert 7/11, 2017 at 19:31

while (p_a != null) near the end of the Thread-A code should be while (p_a == null). The pseudo-code is confusing me. Code is not normally written with thread labels. It's just code that may have multiple threads executing it. Think of it as sender code segment and receiver code segment, then reason about execution overlays on those. – Heins 7/11, 2017 at 20:16

@Art, your reference to Tuesday only applies on x86/64 platforms where a write to a memory address sets the dirty bit for that cache line and when that bit is set, all other cores and sockets always refresh any cache lines referencing the same memory prior to allowing a read from that address. To the extent that you can ignore time of flight for inter-core and inter-socket signalling, x86/64 platforms maintain global ordering between threads (ARM, PPC don't). Intra-thread reordering in the instruction pipelines or by compiler optimizations is still possible and must be accounted for. – Heins 7/11, 2017 at 20:28

Only a strong fence will work on all platforms, but that may be over-kill in this case on x86/64 systems. The problem is more intractable than I initially believed. Getting it always right, requires deep knowledge of each platform's memory model (not explicitly defined in most cases). On top of that, different tool chains provide varying levels of explicit control. It's relatively simple when the shared data can be squeezed into a single atomic operation, but you still have beware of compiler optimizations making incorrect assumptions, particularly where undefined behavior comes into play. – Heins 7/11, 2017 at 20:40

@Heins No, it applies to all platforms that implement pthreads because the POSIX standard says so. Yes, I have worked with alphas, I have worked with sparcs configured to no memory ordering constraints whatsoever (and cache aliasing to make things more exiciting), etc. I have even implemented locking schemes in a kernel myself. It doesn't matter what the CPU does, pthreads says that memory is synchronised by the lock functions, so the memory is synchronised by the lock functions. Btw. x86 does not maintain global ordering, never has guaranteed it and definitely hasn't since Core 2. – Mehta 8/11, 2017 at 7:43

@jwdonahuej I have reviewed the pseudocode again. The loops are as I wanted them to be. I suspect you do not yet understand the desired workflow. You have to imagine Thread-A and -B as 2 different modules communicating using global data. A sends and B receives. A needs to wait for B to to say it read the last value, by setting p to null., before sending a new one. – Deckert 8/11, 2017 at 11:51

If you just want to exchange pointers in a safe way on Windows, you can use InterlockedExchangePointer (it's much better than using CS which is often overkill). Also on Windows, CS and MemoryBarriers exists but are different concepts. IMHO, you really should read the whole "Synchronization" chapter msdn.microsoft.com/en-us/library/windows/desktop/ms686353.aspx before implementing anything. – Cramer 9/11, 2017 at 7:2

I'l have a look at it. But I can only use it, if I can build a queue API that 'looks' identical on pthreads. – Deckert 9/11, 2017 at 11:33

My review of the documentation of C++11 and similar wording in C11:n1570.pdf lead me to the following understanding.

Data is safely usable between threads, if some form of co-operative synchronization is performed between the threads. If there was a queue, which within a mutex read an item from the queue, and if items were added to the queue whilst the mutex was held, then the memory readable in the second thread would be the memory which had been written to in the first thread.

That is because the compiler and underlying CPU infrastructure are not allowed to organize side effects which pass through the ordering.

From n1570

An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X:

— A synchronizes with X and X is sequenced before B,

— A is sequenced before X and X inter-thread happens before B, or

— A inter-thread happens before X and X inter-thread happens before B

So to ensure the memory visible in the new thread is consistent, then the following would guarantee the results.

A Mutex accessing the lock
An interlocked write at the producer + an interlocked read at the consumer

The interlocked write, causes all the preceding operations on thread A to be sequenced and be cache-flushed before thread B sees the read.

After the data is written to the queue for "other thread processing", then the first thread can't safely (unlocked) modify or read any of the memory in the object, until it knows (through some mechanism) that the other thread is not accessing the data any more. It will only see the correct results, if this is done through some synchronization mechanism.

Both the C++ and C standards are intended to formalize existing behavior of compilers and CPUs. So although there are less formal guarantees in the use of pthreads, and C99 standards, these would be expected to be consistent.

From your example

Thread A

int index = findFreeIndex(thread_A_buffer)

This line is problematic, as it doesn't show any synchronization primitives. If the mechanism for findFreeIndex only relies on memory which is written by thread A, then this will work. If thread B, or any other thread modifies the memory, then a further lock is necessary.

lock(m)
p = &(thread_A_buffer[index])
signal()
unlock(m)

This is covered by ....

15 An evaluation A is dependency-ordered before an evaluation B if

— A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or

— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.

and

18 An evaluation A happens before an evaluation B if A is sequenced before B or A interthread happens before B.

The operations before the synchronization "happen before" the synchronization, and are guaranteed to be visible after the synchronization in the other thread.

The lock (an acquire), and unlock (a release), ensure there is a strict ordering on the information in thread A being completed and visible to B.

thread_A_buffer[index] = 42;      // happens before

At the moment the memory thread_A_buffer is visible on A, but to read it on B causes undefined behavior.

lock(m);  // acquire

Although needed for the release, I can't see any results from the acquire.

p = &thread_A_buffer[index];
unlock(m);

All the instruction stream of A is now visible to B (due to its synchronization with m).

thread_A_buffer[index] = 42;  << This happens before and ...
p = &thread_A_buffer[index];  << carries a dependency into p
unlock(m);

All the stuff in A is now visible to B because

An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X

— A synchronizes with X and X is sequenced before B,

— A is sequenced before X and X inter-thread happens before B, or

— A inter-thread happens before X and X inter-thread happens before B.

pointer p_a = null
do:
  // sleep
  lock(m)
  p_a = p
  unlock(m)
while (p_a != null)

This code is completely safe, the value read into p_a will be ordered with the other thread, and will be not-null after the syncronized write in thread b. Again, the lock/unlock cause a strict ordering which ensure the read value will be the written value.

All of thread B's interactions are within a lock, so again completely safe.

If A were to modify the object after it had given the object to B, then it would not work, unless there was some further synchronization.

Cuthbertson answered 21/1, 2018 at 10:19 Comment(1)

If you can modify your answer to make a copy your quote from n1570, replacing A/B/X with the appropriate part of my pseudocode, so as to make it (convincingly) clear which sequence of "happens before" leads to Thread-B seeing "thread_A_buffer[index] = 42" (basically, "applying" the pseudocode to the quote), I'll accept your answer. – Deckert 21/1, 2018 at 17:23

I also use Nim for personal projects. Nim have a garbage collector and you must avoid it for you thread's memory handling routines using it's C invocation:

https://nim-lang.org/docs/backends.html

In Linux, the malloc uses internally mutexes to avoid corruption from concurrent access. I think Windows do the same. You can use the memory freely but need to avoid multiple 'free' or access collision (you must guarantee that only one thread is using the memory and could 'free' it).

You mentioned that you use a custom heap implementation. This heap probably is accessible from other threads but you must check if this library will no do a 'free' for a pointer that is being handled by another thread. If this custom heap implementation is Nim's garbage collector, then you must avoid it at all costs and do a custom C implementation of memory access and use Nim's C invocation for memory malloc and free.

Arbogast answered 14/12, 2017 at 12:34 Comment(2)

Nim's thread-local heap is not automatically garbage-collected, AFAIK. It is only garbage-collected when you use "refs", but since you can use "ptr" too, and those are not tracked (otherwise they would be the same as refs), you can "safely" allocate "manully" memory in the thread-local heap, and know that it won't we deleted unless you delete it yourself (which, obviously must happen in the same thread). – Deckert 17/12, 2017 at 11:54

accordingly to this page: nim-lang.org/faq.html "Each thread has its own GC" nim gc collects thread local heap, so as I told before, it's better to handle thread memory that is shared with it's own handling procedures. – Arbogast 17/12, 2017 at 23:13

If you want to have platform independency then you need to use multiple concent of os and c:

use of mutex lock and unlock for synchronization.
use of condtional variable for signal to other thread.
use of heap memory with keep increment when assign to other thread and decremnt it once it acess is over.this will avoid invalid free.

Norvell answered 14/12, 2017 at 14:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++