Analyzing memory mapping of a process with pmap. [stack]
Asked Answered
F

1

3

I'm trying to understand how stack works in Linux. I read AMD64 ABI sections about stack and process initialization and it is not clear how the stack should be mapped. Here is the relevant quote (3.4.1):

Stack State

This section describes the machine state that exec (BA_OS) creates for new processes.

and

It is unspecified whether the data and stack segments are initially mapped with execute permissions or not. Applications which need to execute code on the stack or data segments should take proper precautions, e.g., by calling mprotect().

So I can deduce from the quotes above that the stack is mapped (it is unspecified if PROT_EXEC is used to create the mapping). Also the mapping is created by exec.

The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?

Looking at pmap -x <pid> the stack is marked with [stack] as

00007ffc04c78000     132      12      12 rw---   [ stack ]

Creating a mapping as

mmap(NULL, 4096,
     PROT_READ | PROT_WRITE,
     MAP_ANONYMOUS | MAP_PRIVATE | MAP_STACK,
     -1, 0);

simply creates anonymous mapping as that is shown in pmap -x <pid> as

00007fb6e42fa000       4       0       0 rw---   [ anon ]
Farewell answered 4/7, 2019 at 19:21 Comment(1)
the function: sbrk() is for changing the data segment size. It has nothing to do with the stackEnunciate
S
6

I can deduce from the quotes above that the stack is mapped

That literally just means that memory is allocated. i.e. that there is a logical mapping from those virtual addresses to physical pages. We know this because you can use a push or call instruction in _start without making a system call from user-space to allocate a stack.

In fact the x86-64 System V ABI specifies that argc, argv, and envp are on the stack at process startup.

The question is whether the "main thread"'s stack uses MAP_GROWSDOWN | MAP_STACK mapping or maybe even via sbrk?

The ELF binary loader sets the _GROWSDOWN flag for the main thread's stack, but not the MAP_STACK flag. This is code inside the kernel, and it does not go through the regular mmap system call interface.

(Nothing in user-space uses mmap(MAP_GROWSDOWN) so normally the main thread stack is the only mapping that have the VM_GROWSDOWN flag inside the kernel.)

The internal name of the flag that is used for the virtual memory aree (VMA) of the stack is called VM_GROWSDOWN. In case you're interested, here are all the flags that are used for the main thread's stack: VM_GROWSDOWN, VM_READ, VM_WRITE, VM_MAYREAD, VM_MAYWRITE, and VM_MAYEXEC. In addition, if the ELF binary is specified to have an executable stack (e.g., by compiling with gcc -z execstack), the VM_EXEC flag is also used. Note that on architectures that support stacks that grow upwards, VM_GROWSUP is used instead of VM_GROWSDOWN if the kernel was compiled with CONFIG_STACK_GROWSUP defined. The line of code where these flags are specified in the Linux kernel can be found here.

/proc/.../maps and pmap don't use the VM_GROWSDOWN - they rely on address comparison instead. Therefore they may not be able to determine exactly the exact range of the virtual address space that the main thread's stack occupies (see an example). On the other hand, /proc/.../smaps looks for the VM_GROWSDOWN flag and marks each memory region that has this flag as gd. (Although it seems to ignore VM_GROWSUP.)

All of these tools/files ignore the MAP_STACK flag. In fact, the whole Linux kernel ignores this flag (which is probably why the program loader doesn't set it.) User-space only passes it for future-proofing in case the kernel does want to start treating thread-stack allocations specially.


sbrk makes no sense here; the stack isn't contiguous with the "break", and the brk heap grows upward toward the stack anyway. Linux puts the stack very near the top of virtual address space. So of course the primary stack couldn't be allocated with (the in-kernel equivalent of) sbrk.


And no, nothing uses MAP_GROWSDOWN, not even secondary thread stacks, because it can't in general be used safely.

The mmap(2) man page which says MAP_GROWSDOWN is "used for stacks" is laughably out of date and misleading. See How to mmap the stack for the clone() system call on linux?. As Ulrich Drepper explained in 2008, code using MAP_GROWSDOWN is typically broken, and proposed removing the flag from Linux mmap and from glibc headers. (This obviously didn't happen, but pthreads hasn't used it since well before then, if ever.)


MAP_GROWSDOWN sets the VM_GROWSDOWN flag for the mapping inside the kernel. The main thread also uses that flag to enable the growth mechanism, so a thread stack may be able to grow the same way the main stack does: arbitrarily far (up to ulimit -s?) if the stack pointer is below the page fault location. (Linux does not require "stack probes" to touch every page of a large multi-page stack array or alloca.)

(Thread stacks are fully allocated up front; only normal lazy allocation of physical pages to back that virtual allocation avoids wasting space for thread stacks.)

MAP_GROWSDOWN mapping can also grow the way the mmap man page describes: access to the "guard page" below the lowest mapped page will also trigger growth, even if that's below the bottom of the red zone.

But the main thread's stack has magic you don't get with mmap(MAP_GROWSDOWN). It reserves the growth space up to ulimit -s to prevent random choice of mmap address from creating a roadblock to stack growth. That magic is only available to the in-kernel program-loader which maps the main thread's stack during execve(), making it safe from an mmap(NULL, ...) randomly blocking future stack growth.

mmap(MAP_FIXED) could still create a roadblock for the main stack, but if you use MAP_FIXED you're 100% responsible for not breaking anything. (Unlimited stack cannot grow beyond the initial 132KiB if MAP_FIXED involved?). MAP_FIXED will replace existing mappings and reservations, but anything else will treat the main thread's stack-growth space as reserved;. (I think that's true; worth trying with MAP_FIXED_NOREPLACE or just a non-NULL hint address)

See

pthread_create doesn't use MAP_GROWSDOWN for thread stacks, and neither should anyone else. Generally do not use. Linux pthreads by default allocates the full size for a thread stack. This costs virtual address space but (until it's actually touched) not physical pages.

The inconsistent results in comments on Why is MAP_GROWSDOWN mapping does not grow? (some people finding it works, some finding it still segfaults when touching the return value and the page below) sound like https://bugs.centos.org/view.php?id=4767 - MAP_GROWSDOWN may even be buggy outside of the way the standard main-stack VM_GROWSDOWN mapping is used.

Subtile answered 7/7, 2019 at 9:9 Comment(11)
If we examine the source code of pmap, we can see that it relies on information from the /proc/pid/stat file to determine whether a given address is in a stack or not. An object of type proc_t with this information is initialized here. One of the fields is start_stack, which holds the base address of program stack (the stack of the main thread). pmap also uses...Pandanus
.../proc/pid/smap to get the size of the stack. Then at this line of code, it compares a given address to the base stack address, and, if it falls in the stack region, it is marked as [ stack ]. Therefore, it doesn't matter whether a memory region is allocated using the MAP_GROWSDOWN flag or not. The tool just doesn't check this flag. In addition, when passing the MAP_GROWSDOWN flag to mmap, it gets translated to an internal flag called VM_GROWSDOWN...Pandanus
It is this flag that is checked by the page fault handler to determine whether a page fault has occurred in a stack and that it should grow the stack (if possible). All threads' stacks are marked with VM_GROWSDOWN. I think that had pmap used MAP_GROWSDOWN instead of comparing with program stack address, it would have printed [ stack ] for any region allocated with this flag, even if the region is not used as a thread's stack.Pandanus
So the mmap(2) man page is accurate when it says that MAP_GROWSDOWN is "used for stacks," because it is.Pandanus
@HadiBrais: Are you sure pthread_create actually does use MAP_GROWSDOWN for allocating this region for thread stacks, though? From what I've read it doesn't do that, but I don't have time to double-check right now.Subtile
pthread_create doesn't use that flag (see the mmap call to allocate the stack). Only the stack of the main thread is created with that flag. But then I'm not sure how the page fault handler can tell whether an address belongs to a stack region in that case. BTW, all memory regions allocated with that flag are marked with a flag called gd (stands for grows down) in cat /proc/pid/smaps. You can easily tell using smaps.Pandanus
@HadiBrais: thread stacks don't grow, they're fully allocated with ulimit -s space. Only normal mmap lazy-allocation mechanisms avoid wasting physical pages and page-table space on them.Subtile
I made some edits if don't mind. You may also want to remove or edit the sentence that says "laughably out of date."Pandanus
@HadiBrais: Why wouldn't I say "laughably out of date"? glibc maintainers (specifically Drepper) were explaining how broken MAP_GROWSDOWN was back in 2008 (lwn.net/Articles/294001), and proposing deprecating then removing it from the Linux kernel and/or glibc. If you're thinking about the connection with in-kernel VM_GROWSDOWN for the main stack, remember that I'm talking specifically about the flag for the user-space mmap system call there, which isn't involved at all in setting up that mapping.Subtile
Fair point. Is there any further discussion on that from other kernel developers? I found only one response to Drepper here, which disagrees with removing MAP_GROWSDOWN.Pandanus
@HadiBrais: Just finished an update to my answer. Thanks for the edit, that discussion of in-kernel VM flags was useful. I put back in the fact(?) that nothing in user-space normally uses MAP_GROWSDOWN because I think that's an important point. I didn't look for any further discussion, but yes clearly MAP_GROWSDOWN didn't actually get removed from glibc headers.Subtile

© 2022 - 2024 — McMap. All rights reserved.