Stack allocation for C++ green threads

Asked 1/2, 2016 at 4:59 Answered 23/11, 2023 at 5:34

c++memory-management green-threads boost-coroutine

I'm doing some research in C++ green threads, mostly boost::coroutine2 and similar POSIX functions like makecontext()/swapcontext(), and planning to implement a C++ green thread library on top of boost::coroutine2. Both require the user code to allocate a stack for every new function/coroutine.

My target platform is x64/Linux. I want my green thread library to be suitable for general use, so the stacks should expand as required (a reasonable upper limit is fine, e.g. 10MB), it would be great if the stacks could shrink when too much memory is unused (not required). I haven't figured out an appropriate algorithm to allocate stacks.

After some googling, I figured out a few options myself:

use split stack implemented by the compiler (gcc -fsplit-stack), but split stack has performance overhead. Go has already moved away from split stack due to performance reasons.
allocate a large chunk of memory with mmap() hope the kernel is smart enough to leave the physical memory unallocated and allocate only when the stacks are accessed. In this case, we are at the mercy of the kernel.
reserve a large memory space with mmap(PROT_NONE) and setup a SIGSEGV signal handler. In the signal handler, when the SIGSEGV is caused by stack access (the accessed memory is inside the large memory space reserved), allocate needed memory with mmap(PROT_READ | PROT_WRITE). Here is the problem for this approach: mmap() isn't asynchronous safe, cannot be called inside a signal handler. It still can be implemented, very tricky though: create another thread during program startup for memory allocation, and use pipe() + read()/write() to send memory allocation information from the signal handler to the thread.

A few more questions about option 3:

I'm not sure the performance overhead of this approach, how well/bad the kernel/CPU performs when the memory space is extremely fragmented due to thousands of mmap() call ?
Is this approach correct if the unallocated memory is accessed in kernel space ? e.g. when read() is called ?

Are there any other (better) options for stack allocation for green threads ? How are green thread stacks allocated in other implementations, e.g. Go/Java ?

Hedveh answered 1/2, 2016 at 4:59 Comment(15)

While mmap is not async safe according to POSIX, it is actually async safe in Linux and pretty much every reasonable, usable UNIX variant out there. – Brutus 1/2, 2016 at 5:5

@ChrisDodd Can I ask why mmap can be good for green threads? I'm not an expert but I wanted to know. – Pachyderm 1/2, 2016 at 5:7

@ChrisDodd I haven't find any man page/link on this, could you mind please give me a link ? – Hedveh 1/2, 2016 at 5:38

FWIW, I do not know if Linux shared memory fits your needs. But I used that for a high-performance backend to a google maps application a few years ago, and the performance was very good. – Finegan 1/2, 2016 at 8:37

@ErikAlapää Green threads all run on the same kernel thread, so they share the same address space. en.wikipedia.org/wiki/Green_threads – Hedveh 1/2, 2016 at 9:35

@Hedveh Yeah right, then you obviously will not need shared mem. – Finegan 1/2, 2016 at 10:17

why not use stackless coroutines? – Jinny 15/2, 2016 at 6:38

@Jinny Stackless coroutines can only support very limited suspend/resume operations, but my goal is to develop a library suitable for general use, I don't want such limitations on my library. – Hedveh 15/2, 2016 at 7:9

I believe in most cases where you'd want to use coroutines/green threads (i.e. asynchronous I/O) the stackless suspend/resume is sufficient. What's the use case for your library? – Jinny 15/2, 2016 at 7:55

@Jinny Actually ATM it's exactly asynchronous I/O that I want to get rid of, mainly because asynchronous code tends to be more difficult to understand (I know it's more efficient). As I said in the main question, I'm doing some research, so use cases aren't that important. – Hedveh 15/2, 2016 at 8:3

hmm but ASIO implemented via coroutines (i.e. boost::asio) is kinda the way to go imho. anyway - why not base it on boost::asio, it has already a spawn() method which allocates the stack. also you could use one of the C libraries - libtask / libconcurrency - both allocate stack for you. – Jinny 15/2, 2016 at 8:10

How well do C++ exceptions propagate under libtask/libconcurrency ? I did a brief googling, haven't even find out whether it's supported. – Hedveh 15/2, 2016 at 8:25

I don't think it's explicitly supported but as long as the compiler doesn't do some weird OS-specific stuff, all data necessary for correct exception handling (stack unwinding) should be on the stack anyway. One special case might be Windows and SEH as I've heard exception handling is partly implemented using that mechanism, so potentially as part of context switch you need to switch SEH handlers chain as well? – Jinny 15/2, 2016 at 9:21

Let us continue this discussion in chat. – Hedveh 15/2, 2016 at 10:58

If a stack area can contain a pointer to something inside the stack area, how will you relocate the stack area if it grows? If you aren't relocating it, then you'll have to allocate the max needed when you start. – Kinsler 15/2, 2016 at 14:45

The way that glibc allocates stacks for normal C programs is to mmap a region with the following mmap flag designed just for this purpose:

   MAP_GROWSDOWN
          Used for stacks.  Indicates to the kernel virtual memory  system
          that the mapping should extend downward in memory.

For compatibility, you should probably use MAP_STACK too. Then you don't have to write the SIGSEGV handler yourself, and the stack grows automatically. The bounds can be set as described here What does "ulimit -s unlimited" do?

If you want a bounded stack size, which is normally what people do for signal handlers if they want to call sigaltstack(2), just issue an ordinary mmap call.

The Linux kernel always maps physical pages that back virtual pages, catching the page fault when a page is first accessed (perhaps not in real-time kernels but certainly in all other configurations). You can use the /proc/<pid>/pagemap interface (or this tool I wrote https://github.com/dwks/pagemap) to verify this if you are interested.

Fagaceous answered 26/3, 2016 at 23:32 Comment(1)

I've heard that MAP_GROWSDOWN can cause some sorts of problems. Is this still true today? – Cartierbresson 26/6, 2019 at 23:29

Why mmap? When you allocate with new (or malloc) the memory is untouched and definitely not mapped.

const int STACK_SIZE = 10 * 1024*1024;
char*p = new char[STACK_SIZE*numThreads];

p now has enough memory for the threads you want. When you need the memory, start accessing p + STACK_SIZE * i

Highly answered 15/2, 2016 at 13:42 Comment(2)

That's definitely not guaranteed to be unmapped, or initialized to any value in particular. Using the GNU libc malloc, that large of an allocation will ultimately call mmap() anyway. – Marlborough 17/2, 2016 at 23:12

You will want to place a guard page at the end of the stack, and you can't do that with malloc() or new. You need mmap() – Toledo 23/11, 2023 at 5:36

Others have mentioned MAP_GROWSDOWN. MAP_GROWSDOWN can conflict with other mapped memory regions (see this correspondence between a RedHat employee with lots of Linux kernel familiarity and some prominent Linux kernel maintainers). It is also hard to know how far your mapping will be allowed to grow. For example, if mmap() chooses to place the first page of your stack just three 4kb pages above the next mapping, your stack can only grow to three memory pages. Additionally, if you need to munmap() the stack, you will have to somehow determine how large the stack has grown to unmap it.

You can instead rely on the fact that any OS worth its salt (including all major OSs) will not actually map physical pages when you call mmap(), unless you tell mmap() to pre-fault the pages (e.g. by using th MAP_LOCKED flag). The OS won't map physical memory until a mapped page is touched, meaning a load or store is made to an address in that page. At that point, the CPU will trigger a page fault and call into the OS. The OS will see that you mapped the page with mmap() and then create the mapping to physical memory. Thus, you can mmap() an 8MB stack for a green thread and if the green thread only ever uses 500 bytes of the stack, only one page of memory will be used.

One more thing: you probably want a guard page at the end of your stack to prevent a program from overgrowing the stack into another mapped region of memory (instead, it should segfault because it overflowed the stack). The guard page won't have any physical memory associated with it, so it won't actually take up any physical memory. You can achieve this using a combination of mmap() and mprotect() like so:

#include <unistd.h>
#include <sys/mman.h>

#define STACK_SIZE 1024 * 1024 * 8 // 8mb
#define PAGE_SIZE 4096 // 4kb

// PROT_NONE tells the OS that the process isn't allowed to read or write
// to these pages. We'll make them read/writeable in a sec.
//
// MAP_STACK does nothing on Linux, but some BSDs will kill your program
// if the stack pointer points to a region that hasn't been mapped
// using this flag. macOS does not have this flag, so you may have
// to ifndef define it to 0 for cross-compatibility with macOS.
void *stack = mmap(0, STACK_SIZE + PAGE_SIZE, PROT_NONE,
                   MAP_PRIVATE | MAP_ANON | MAP_STACK, -1, 0);

if (stack == -1) {
    abort();
}

// Make the mapping readable/writeable, except a guard page at the bottom
// of the mapping (remember stacks grow downwards; you'll want to set the
// thread's stack pointer to the TOP of this mapping). If a thread tries
// to use more than STACK_SIZE of the stack, the program will segfault.
mprotect(stack + PAGE_SIZE, STACK_SIZE, PROT_READ | PROT_WRITE);

Depending on the situation, you may want to use mlock2() with MCL_ONFAULT to tell the OS to not swap the stack's pages and instead keep them in physical memory, but be careful with this as you may start getting mmap() failures if the cumulative size of all the thread stacks exceeds the size of physical memory.

As a bonus, here is that same thing, but for the Windows API using VirtualAlloc() and VirtualProtect():

#include <memoryapi.h>
#include <windows.h> // This include may be redundant

#define STACK_SIZE 1024 * 1024 * 8
#define PAGE_SIZE 4096

void *stack = VirtualAlloc(0, STACK_SIZE + PAGE_SIZE,
                           MEM_RESERVE | MEM_COMMIT,
                           PAGE_GUARD);

if (!stack) {
    abort();
}

void *_oldprot; // You may ignore this variable
VirtualProtect(stack + PAGE_SIZE, STACK_SIZE,
               PAGE_READWRITE, _oldprot);

To briefly answer your question about performance overhead, I wouldn't worry about address space fragmentation on a 64-bit CPU (unless you are mapping hundreds of terabytes of memory). Thousands of mmap() calls is nothing. The virtual-to-physical memory mapping can be arbitrary; your OS will take care of physical memory fragmentation (it can even move pages of physical memory around without you knowing it).

Toledo answered 23/11, 2023 at 5:34 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags