C/C++ maximum stack size of program on mainstream OSes
Asked Answered
K

7

154

I want to do DFS on a 100 X 100 array. (Say elements of array represents graph nodes) So assuming worst case, depth of recursive function calls can go upto 10000 with each call taking upto say 20 bytes. So is it feasible means is there a possibility of stackoverflow?

What is the maximum size of stack in C/C++?

Please specify for gcc for both
1) cygwin on Windows
2) Unix

What are the general limits?

Kacey answered 1/12, 2009 at 12:42 Comment(3)
You do know that you can implement depth-first search without recursion, right?Massasoit
No i dont know, please explain.Kacey
I've made a small example of DFS without recursion in my answerLemmy
L
138

In Visual Studio the default stack size is 1 MB i think, so with a recursion depth of 10,000 each stack frame can be at most ~100 bytes which should be sufficient for a DFS algorithm.

Most compilers including Visual Studio let you specify the stack size. On some (all?) linux flavours the stack size isn't part of the executable but an environment variable in the OS. You can then check the stack size with ulimit -s and set it to a new value with for example ulimit -s 16384.

Here's a link with default stack sizes for gcc.

DFS without recursion:

std::stack<Node> dfs;
dfs.push(start);
do {
    Node top = dfs.top();
    if (top is what we are looking for) {
       break;
    }
    dfs.pop();
    for (outgoing nodes from top) {
        dfs.push(outgoing node);
    }
} while (!dfs.empty())
Lemmy answered 1/12, 2009 at 12:47 Comment(5)
And just for reference, a BFS is the same except that you use a FIFO instead of a stack.Vinny
Yes, or in STL-lingo use a std::deque with pop_front/push_backLemmy
your DFS with stack results will be different from recursion version. In some cases it doesn't matter, but in others(for ex. in topological sort) you will get wrong resultsMagnien
Yes, the default limit for VS is indeed 1MB. More info and the way to set a different value can be found in Microsoft documentation: msdn.microsoft.com/en-us/library/tdkhxaks(v=vs.140).aspxWormhole
I prefer to use an explicit stack data structure for such algorithms, rather than recursion, so that 1. don't depend on size of system stack, 2. can change the algorithm to use a different data structure e.g. queue or priority queue without throwing out all the code.Rachael
C
76

Stacks for threads are often smaller. You can change the default at link time, or change at run time also. For reference, some defaults are:

  • glibc i386, x86_64: 7.4 MB
  • Tru64 5.1: 5.2 MB
  • Cygwin: 1.8 MB
  • Solaris 7..10: 1 MB
  • MacOS X 10.5: 460 KB
  • AIX 5: 98 KB
  • OpenBSD 4.0: 64 KB
  • HP-UX 11: 16 KB
Castaneda answered 1/12, 2009 at 13:1 Comment(4)
Determined empirically by Bruno Haible lists.gnu.org/archive/html/bug-coreutils/2009-10/msg00262.htmlCastaneda
I've pasted Bruno Haible's code and quotes in my new answer here, and shown how to manually test your own system using his code: https://mcmap.net/q/18201/-c-c-maximum-stack-size-of-program-on-mainstream-oses.Gouty
Linux's default ulimit -s is 8 MiB; if you measure less than that, it means some of the stack was already used up holding env vars and command-line args at process startup. More than half a meg seems excessive, though; perhaps measurement error if the compiler used more space than expected for an alloca(128). (@GabrielStaples). You can check /proc/<PID>/smaps at the point when it segfaults to see the 8MiB region.Llovera
for me ulimit gives output: 8176 However when I measure it within program it gives me approximately 8167 which is correct on macOS Monterey 12.6.1Sequential
S
20

Platform-dependent, toolchain-dependent, ulimit-dependent, parameter-dependent.... It is not at all specified, and there are many static and dynamic properties that can influence it.

Steeve answered 1/12, 2009 at 12:44 Comment(3)
There are no "general limits". On Windows, with default VC++ linker options and default CreateThread behaviour, typically something around 1 MiB per thread. On Linux, with an unlimited user, I believe that there is typically no limit (the stack can just grow downwards to occupy almost the entire address space). Basically, if you have to ask, you shouldn't be using the stack.Steeve
On embedded systems, you might have 4k or less. In which case you do have to ask even when it's reasonable to be using the stack. The answer is usually a Gallic shrug.Vinny
Ah true, also often the case in kernel mode.Steeve
G
12

(Added 26 Sept. 2020)

On 24 Oct. 2009, as @pixelbeat first pointed out here, Bruno Haible empirically discovered the following default thread stack sizes for several systems. He said that in a multithreaded program, "the default thread stack size is" as follows. I added in the "Actual" size column because @Peter.Cordes indicates in his comments below my answer, however, that the odd tested numbers shown below do not include all of the thread stack, since some of it was used in initialization. If I run ulimit -s to see "the maximum stack size" that my Linux computer is configured for, it outputs 8192 kB, which is exactly 8 MB, not the odd 7.4 MB listed in the table below for my x86-64 computer with the gcc compiler and glibc. So, you can probably add a little to the numbers in the table below to get the actual full stack size for a given thread.

Note also that the below "Tested" column units are all in MB and KB (base 1000 numbers), NOT MiB and KiB (base 1024 numbers). I've proven this to myself by verifying the 7.4 MB case.

Thread stack sizes

System and std library  Tested  Actual
----------------------  ------  ------
- glibc i386, x86_64    7.4 MB  8 MiB (8192 KiB, as shown by `ulimit -s`)
- Tru64 5.1             5.2 MB  ?
- Cygwin                1.8 MB  ?
- Solaris 7..10           1 MB  ?
- MacOS X 10.5          460 KB  ?
- AIX 5                  98 KB  ?
- OpenBSD 4.0            64 KB  ?
- HP-UX 11               16 KB  ?

Bruno Haible also stated that:

32 KB is more than you can safely allocate on the stack in a multithreaded program

And he said:

And the default stack size for sigaltstack, SIGSTKSZ, is

  • only 16 KB on some platforms: IRIX, OSF/1, Haiku.
  • only 8 KB on some platforms: glibc, NetBSD, OpenBSD, HP-UX, Solaris.
  • only 4 KB on some platforms: AIX.

Bruno

He wrote the following simple Linux C program to empirically determine the above values. You can run it on your system today to quickly see what your maximum thread stack size is, or you can run it online on GDBOnline here: https://onlinegdb.com/rkO9JnaHD.

Explanation: It simply creates a single new thread, so as to check the thread stack size and NOT the program stack size, in case they differ, then it has that thread repeatedly allocate 128 bytes of memory on the stack (NOT the heap), using the Linux alloca() call, after which it writes a 0 to the first byte of this new memory block, and then it prints out how many total bytes it has allocated. It repeats this process, allocating 128 more bytes on the stack each time, until the program crashes with a Segmentation fault (core dumped) error. The last value printed is the estimated maximum thread stack size allowed for your system.

Important note: alloca() allocates on the stack: even though this looks like dynamic memory allocation onto the heap, similar to a malloc() call, alloca() does NOT dynamically allocate onto the heap. Rather, alloca() is a specialized Linux function to "pseudo-dynamically" (I'm not sure what I'd call this, so that's the term I chose) allocate directly onto the stack as though it was statically-allocated memory. Stack memory used and returned by alloca() is scoped at the function-level, and is therefore "automatically freed when the function that called alloca() returns to its caller." That's why its static scope isn't exited and memory allocated by alloca() is NOT freed each time a for loop iteration is completed and the end of the for loop scope is reached. See man 3 alloca for details. Here's the pertinent quote (emphasis added):

DESCRIPTION
The alloca() function allocates size bytes of space in the stack frame of the caller. This temporary space is automatically freed when the function that called alloca() returns to its caller.

RETURN VALUE
The alloca() function returns a pointer to the beginning of the allocated space. If the allocation causes stack overflow, program behavior is undefined.

Here is Bruno Haible's program from 24 Oct. 2009, copied directly from the GNU mailing list here:

Again, you can run it live online here.

// By Bruno Haible
// 24 Oct. 2009
// Source: https://lists.gnu.org/archive/html/bug-coreutils/2009-10/msg00262.html

// =============== Program for determining the default thread stack size =========

#include <alloca.h>
#include <pthread.h>
#include <stdio.h>
void* threadfunc (void*p) {
  int n = 0;
  for (;;) {
    printf("Allocated %d bytes\n", n);
    fflush(stdout);
    n += 128;
    *((volatile char *) alloca(128)) = 0;
  }
}

int main()
{
  pthread_t thread;
  pthread_create(&thread, NULL, threadfunc, NULL);
  for (;;) {}
}

When I run it on GDBOnline using the link above, I get the exact same results each time I run it, as both a C and a C++17 program. It takes about 10 seconds or so to run. Here are the last several lines of the output:

Allocated 7449856 bytes
Allocated 7449984 bytes
Allocated 7450112 bytes
Allocated 7450240 bytes
Allocated 7450368 bytes
Allocated 7450496 bytes
Allocated 7450624 bytes
Allocated 7450752 bytes
Allocated 7450880 bytes
Segmentation fault (core dumped)

So, the thread stack size is ~7.45 MB for this system, as Bruno mentioned above (7.4 MB).

I've made a few changes to the program, mostly just for clarity, but also for efficiency, and a bit for learning.

Summary of my changes:

  1. [learning] I passed in BYTES_TO_ALLOCATE_EACH_LOOP as an argument to the threadfunc() just for practice passing in and using generic void* arguments in C.

    1. Note: This is also the required function prototype, as required by the pthread_create() function, for the callback function (threadfunc() in my case) passed to pthread_create(). See: https://www.man7.org/linux/man-pages/man3/pthread_create.3.html.
  2. [efficiency] I made the main thread sleep instead of wastefully spinning.

  3. [clarity] I added more-verbose variable names, such as BYTES_TO_ALLOCATE_EACH_LOOP and bytes_allocated.

  4. [clarity] I changed this:

     *((volatile char *) alloca(128)) = 0;
    

    to this:

     volatile uint8_t * byte_buff = 
             (volatile uint8_t *)alloca(BYTES_TO_ALLOCATE_EACH_LOOP);
     byte_buff[0] = 0;
    

Here is my modified test program, which does exactly the same thing as Bruno's, and even has the same results:

You can run it online here, or download it from my repo here. If you choose to run it locally from my repo, here's the build and run commands I used for testing:

  1. Build and run it as a C program:

     mkdir -p bin && \
     gcc -Wall -Werror -g3 -O3 -std=c11 -pthread -o bin/tmp \
     onlinegdb--empirically_determine_max_thread_stack_size_GS_version.c && \
     time bin/tmp
    
  2. Build and run it as a C++ program:

     mkdir -p bin && \
     g++ -Wall -Werror -g3 -O3 -std=c++17 -pthread -o bin/tmp \
     onlinegdb--empirically_determine_max_thread_stack_size_GS_version.c && \
     time bin/tmp
    

It takes < 0.5 seconds to run locally on a fast computer with a thread stack size of ~7.4 MB.

Here's the program:

// =============== Program for determining the default thread stack size =========

// Modified by Gabriel Staples, 26 Sept. 2020

// Originally by Bruno Haible
// 24 Oct. 2009
// Source: https://lists.gnu.org/archive/html/bug-coreutils/2009-10/msg00262.html

#include <alloca.h>
#include <pthread.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <unistd.h> // sleep

/// Thread function to repeatedly allocate memory within a thread, printing
/// the total memory allocated each time, until the program crashes. The last
/// value printed before the crash indicates how big a thread's stack size is.
///
/// Note: passing in a `uint32_t` as a `void *` type here is for practice,
/// to learn how to pass in ANY type to a func by using a `void *` parameter.
/// This is also the required function prototype, as required by the
/// `pthread_create()` function, for the callback function (this function)
/// passed to `pthread_create()`. See:
/// https://www.man7.org/linux/man-pages/man3/pthread_create.3.html
void* threadfunc(void* bytes_to_allocate_each_loop)
{
    const uint32_t BYTES_TO_ALLOCATE_EACH_LOOP =
            *(uint32_t*)bytes_to_allocate_each_loop;

    uint32_t bytes_allocated = 0;
    while (true)
    {
        printf("bytes_allocated = %u\n", bytes_allocated);
        fflush(stdout);
        // NB: it appears that you don't necessarily need `volatile` here,
        // but you DO definitely need to actually use (ex: write to) the
        // memory allocated by `alloca()`, as we do below, or else the
        // `alloca()` call does seem to get optimized out on some systems,
        // making this whole program just run infinitely forever without
        // ever hitting the expected segmentation fault.
        volatile uint8_t * byte_buff =
                (volatile uint8_t *)alloca(BYTES_TO_ALLOCATE_EACH_LOOP);
        byte_buff[0] = 0;
        bytes_allocated += BYTES_TO_ALLOCATE_EACH_LOOP;
    }
}

int main()
{
    const uint32_t BYTES_TO_ALLOCATE_EACH_LOOP = 128;

    pthread_t thread;
    pthread_create(&thread, NULL, threadfunc,
                   (void*)(&BYTES_TO_ALLOCATE_EACH_LOOP));
    while (true)
    {
        const unsigned int SLEEP_SEC = 10000;
        sleep(SLEEP_SEC);
    }

    return 0;
}

Sample output (same results as Bruno Haible's original program):

bytes_allocated = 7450240                                                                                                                                                        
bytes_allocated = 7450368                                                                                                                                                        
bytes_allocated = 7450496                                                                                                                                                        
bytes_allocated = 7450624                                                                                                                                                        
bytes_allocated = 7450752                                                                                                                                                        
bytes_allocated = 7450880                                                                                                                                                        
Segmentation fault (core dumped) 
Gouty answered 27/9, 2020 at 6:26 Comment(12)
Thanks for contributing this answer. I ran Bruno's code as well on Windows and was a bit confused as to what exactly the output represented (Windows did not give me a seg fault error, just closed the console). Windows with MinGW required #include <malloc.h> instead of #include <alloca.h> so that might be worth mentioning. Also, can we not just catch the seg fault and spit that number out?Mythos
@Skewjo, thanks for the info. to help Window users. How do you catch a seg fault in C? (I'm not saying one can't--I just don't know how). Also, what do you mean by that number when you say and spit that number out? Wouldn't that number just be the last printed value + 128? If so, what extra value does this add (meaning: why should we do this?) to catch the seg fault and then spit out the last printed number + 128 instead of just looking at the last printed number, as is already done?Gouty
program stack size actually means the main thread stack size, right? I do notice a difference of stack size between main thread, and newly created work thread (11169280 vs 9315968), any idea why there is such a difference?Cello
@baye, program stack size actually means the main thread stack size, right? That is my assumption too, so I think so, but I'm not 100% sure. As for the difference in stack size between the main thread and the created (worker) thread, I don't know why there is a difference, but it doesn't surprise me that a difference exists. Consider posting your code and asking about this in a new question, if a question for this doesn't already exist. If you do so, post a link to your question here so we can see it.Gouty
Linux's default ulimit -s is 8 MiB; that sets the main thread's stack size growth limit. Env vars and cmdline args take a bit of space at the top of that. Other threads started via pthreads don't dynamically grow their stacks, it's a fixed-size allocation (using the same default 8MiB). You can check /proc/<PID>/smaps at the point when it segfaults to see the 8MiB region. Note that it segfaults inside the printf / write call, and that stdio code uses a significant amount of stack space which you aren't measuring.Llovera
But more significantly, the compiled asm for x86-64 does sub rsp, 0x90 inside the loop (for gcc -O3) and then does ptr = (rsp+15) & -16, strangely not taking advantage of the fact that the alloca size is a mulitple of 16 and doesn't need extra work to align. godbolt.org/z/3Wrhs1Yav (see line 17 of the asm output, sub rsp, 144). So your measurement is off by a factor of 9/8, with a constant offset from stack depth of printf.Llovera
When I tested in GDB so I could look at smaps after the segfault, the thread stack was an 8204 kiB allocation. The calculated stack size inside the program was 7451008 bytes, and 7451008 / (128/144) / 1024 is ~8186 kiB, and printf stack depth probably explains the rest.Llovera
@PeterCordes, if you add your own answer to talk about this, I'll upvote it. If you add your own Q&A, drop a link here and I'll take a look. I kind of want to add some of your info to my readme here too (eRCaGuy_hello_world/c/onlinegdb--empirically_determine_max_thread_stack_size.md), but I'd like to see you detail what you did and how you did it in a tutorial if possible, as it's hard to follow what you're talking about without your breadth of knowledge.Gouty
I compiled this program with gcc -O3 -pthread, ran it inside GDB, and looked at /proc/$(pidof a.out)/smaps for it once it crashed. The only mapping of about the right size (7+ MiB) had Size: 8204 kB. From looking at the asm I could see that it allocated 0x90 bytes of stack space for every alloca(0x80) loop iteration. For more about stack growth for the main-thread's stack, see How is Stack memory allocated when using 'push' or 'sub' x86 instructions? (where I showed an example of /proc/<PID>/smaps output.)Llovera
Also How does the linux kernel avoid the stack overwriting the text (instructions)? is an overview of a few related stack-growth things. And BTW, I could have used GDB to find out the current stack pointer in the thread that crashed, print /x $rsp or layout reg, and look for that address in /proc/<PID>/smaps, but with only one mapping being the right size I didn't bother.Llovera
For thread stacks, you can also look at strace ./a.out |& less output, and search for MAP_STACK: mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = ... and a following mprotect with smaller size. So glibc allocated 8196 kiB of space for the new thread, and made exactly 8192 kiB of it read+write, leaving a guard page unreadable / unwriteable.Llovera
BTW, another way to verify that GCC's alloca is using more space than you asked for would be to bump up the size to 4096 for example. Or to 4096-8 it turns out gets GCC to only allocate 4096, not 4096+16. (godbolt.org/z/8T4xfbEdq). With 16 or 8 bytes wasted per allocation, the total fraction uncounted will be much smaller.Llovera
C
9

I just ran out of stack at work, it was a database and it was running some threads, basically the previous developer had thrown a big array on the stack, and the stack was low anyway. The software was compiled using Microsoft Visual Studio 2015.

Even though the thread had run out of stack, it silently failed and continued on, it only stack overflowed when it came to access the contents of the data on the stack.

The best advice I can give is to not declare arrays on the stack - especially in complex applications and particularly in threads, instead use heap. That's what it's there for ;)

Also just keep in mind it may not fail immediately when declaring the stack, but only on access. My guess is that the compiler declares stack under windows "optimistically", i.e. it will assume that the stack has been declared and is sufficiently sized until it comes to use it and then finds out that the stack isn't there.

Different operating systems may have different stack declaration policies.

Cowherd answered 18/8, 2017 at 18:13 Comment(0)
S
8

Yes, there is a possibility of stack overflow. The C and C++ standard do not dictate things like stack depth, those are generally an environmental issue.

Most decent development environments and/or operating systems will let you tailor the stack size of a process, either at link or load time.

You should specify which OS and development environment you're using for more targeted assistance.

For example, under Ubuntu Karmic Koala, the default for gcc is 2M reserved and 4K committed but this can be changed when you link the program. Use the --stack option of ld to do that.

Stipitate answered 1/12, 2009 at 12:44 Comment(5)
@lex: there are no general limits. It depends on a lot of parameters.Sitin
@paxdiablo: WHat is meaning of reserved and commited?Kacey
Reserved is how much address space to allocate, committed is how much to attach backing storage to. In other words, reserving address space does not mean the memory will be there when you need it. If you never use more than 4K stack, you're not wasting real memory for the other 1.6M. If you want to guarantee there'll be enough stack, reserved and committed should be identical.Stipitate
@Stipitate 2M - 4k isn't 1.6M. Just saying. (got me confused the first 3 times I read your comment)Chuu
@griffin, kudos for the first person to catch that in 3+ years. I of course meant "rest of it" - I'll avoid an actual figure so as not to make another possible mistake :-)Stipitate
F
5

I am not sure what you mean by doing a depth first search on a rectangular array, but I assume you know what you are doing.

If the stack limit is a problem you should be able to convert your recursive solution into an iterative solution that pushes intermediate values onto a stack which is allocated from the heap.

Formative answered 1/12, 2009 at 13:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.