How to determine maximum stack usage in embedded system with gcc?
Asked Answered
S

7

57

I'm writing the startup code for an embedded system -- the code that loads the initial stack pointer before jumping to the main() function -- and I need to tell it how many bytes of stack my application will use (or some larger, conservative estimate).

I've been told the gcc compiler now has a -fstack-usage option and -fcallgraph-info option that can somehow be used to statically calculates the exact "Maximum Stack Usage" for me. ( "Compile-time stack requirements analysis with GCC" by Botcazou, Comar, and Hainque ).

Nigel Jones says that recursion is a really bad idea in embedded systems ("Computing your stack size" 2009), so I've been careful not to make any mutually recursive functions in this code.

Also, I make sure that none of my interrupt handlers ever re-enable interrupts until their final return-from-interrupt instruction, so I don't need to worry about re-entrant interrupt handlers.

Without recursion or re-entrant interrupt handlers, it should possible to statically determine the maximum stack usage. (And so most of the answers to How to determine maximum stack usage? do not apply). My understanding is I (or preferably, some bit of code on my PC that is automatically run every time I rebuild the executable) first find the maximum stack depth for each interrupt handler when it's not interrupted by a higher-priority interrupt, and the maximum stack depth of the main() function when it is not interrupted. Then I add them all up to find the total (worst-case) maximum stack depth. That occurs (in my embedded system) when the main() background task is at its maximum depth when it is interrupted by the lowest-priority interrupt, and that interrupt is at its maximum depth when it is interrupted by the next-lowest-priority interrupt, and so on.

I'm using YAGARTO with gcc 4.6.0 to compile code for the LM3S1968 ARM Cortex-M3.

So how do I use the -fstack-usage option and -fcallgraph-info option with gcc to calculate the maximum stack depth? Or is there some better approach to determine maximum stack usage?

(See How to determine maximum stack usage in embedded system? for almost the same question targeted to the Keil compiler .)

Shaky answered 17/6, 2011 at 14:51 Comment(6)
Also note that any use of function pointers will only be captured by dynamic analysis.Trafalgar
To get caller and callee info, you can use -fdump-ipa-cgraph. The callgraph option you refer to does not exist, afaik.Forebear
Re-enabling interrupts just before your return from an ISR won't prevent re-entrancy on a system that allows nested interrupts. The only way to accomplish that is to disable interrupts within an ISR and re-enable them from main code.Meave
@iheanyi: Huh? I'm very careful not to re-enable interrupts before the return-from-interrupt instruction (RETI), so I don't understand your comment. #52887092 ; infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0460d/… ; etc. imply that there are several other ways to prevent re-entrancy that don't involve re-enabling interrupts in the main code. A particular interrupt handler will never be re-entered (nested) if that handler never re-enables interrupts until the final RETI, right?Shaky
David, rereading your question, I see I'm wrong. Assuming you disable interrupts upon entering the ISR, reenabling before the final RETI would ensure you can't corrupt any data touched in the ISR. Whether or not you reenter the ISR at that point doesnt matter.Meave
@Trafalgar It may be possible to deal with function pointers. It depends on the use. For instance, if it is simple a device driver with 'device A/B' or UI with language 'A/B' you can account the worst case pointer into use. You would need a very convoluted function pointer hierarchy (possible with some pseudo-functional programming) not to be able to do this with analysis.Resh
G
28

GCC docs :

-fstack-usage

Makes the compiler output stack usage information for the program, on a per-function basis. The filename for the dump is made by appending .su to the auxname. auxname is generated from the name of the output file, if explicitly specified and it is not an executable, otherwise it is the basename of the source file. An entry is made up of three fields:

  • The name of the function.
  • A number of bytes.
  • One or more qualifiers: static, dynamic, bounded.

The qualifier static means that the function manipulates the stack statically: a fixed number of bytes are allocated for the frame on function entry and released on function exit; no stack adjustments are otherwise made in the function. The second field is this fixed number of bytes.

The qualifier dynamic means that the function manipulates the stack dynamically: in addition to the static allocation described above, stack adjustments are made in the body of the function, for example to push/pop arguments around function calls. If the qualifier bounded is also present, the amount of these adjustments is bounded at compile-time and the second field is an upper bound of the total amount of stack used by the function. If it is not present, the amount of these adjustments is not bounded at compile-time and the second field only represents the bounded part.

I can't find any references to -fcallgraph-info

You could potentially create the information you need from -fstack-usage and -fdump-tree-optimized

For each leaf in -fdump-tree-optimized, get its parents and sum their stack size number (keeping in mind that this number lies for any function with "dynamic" but not "bounded") from -fstack-usage, find the max of these values and this should be your maximum stack usage.

Glassblowing answered 17/6, 2011 at 19:44 Comment(2)
W.r.t. I can't find any references to -fcallgraph-info, Compile-time stack requirements analysis with GCC describes the option. The AdaCore ARM ELF GNAT Community gcc compiler does support -fcallgraph-info.Troytroyer
The Developer Options page in gcc documentation now describes both -fstack-usage and -fcallgraph-info.Inclination
S
14

Just in case no one comes up with a better answer, I'll post what I had in the comment to your other question, even though I have no experience using these options and tools:

GCC 4.6 adds the -fstack-usage option which gives the stack usage statistics on a function-by-function basis.

If you combine this information with a call graph produced by cflow or a similar tool you can get the kind of stack depth analysis you're looking for (a script could probably be written pretty easily to do this). Have the script read the stack-usage info and load up a map of function names with the stack used by the function. Then have the script walk the cflow graph (which can be an easy-to-parse text tree), adding up the stack usage associated with each line for each branch in the call graph.

So, it looks like this can be done with GCC, but you might have to cobble together the right set of tools.

Selfridge answered 17/6, 2011 at 21:44 Comment(1)
If anyone reading this has enough rep for a 1-character edit, there's an https version of the http URL in this answer.Inclination
S
8

Quite late, but for anyone looking at this, the answers given involving combining the outputs from fstack-usage and call graph tools like cflow can end up being wildly incorrect for any dynamic allocation, even bounded, because there's no information about when that dynamic stack allocation occurs. It's therefore not possible to know to what functions you should apply the value towards. As a contrived example, if (simplified) fstack-usage output is:

main        1024     dynamic,bounded
functionA    512     static
functionB     16     static

and a very simple call tree is:

main
    functionA
    functionB

The naive approach to combine these may result in main -> functionA being chosen as the path of maximum stack usage, at 1536 bytes. But, if the largest dynamic stack allocation in main() is to push a large argument like a record to functionB() directly on the stack in a conditional block that calls functionB (I already said this was contrived), then really main -> functionB is the path of maximum stack usage, at 1040 bytes. Depending on existing software design, and also for other more restricted targets that pass everything on the stack, cumulative errors may quickly lead you toward looking at entirely wrong paths claiming significantly overstated maximum stack sizes.

Also, depending on your classification of "reentrant" when talking about interrupts, it's possible to miss some stack allocations entirely. For instance, many Coldfire processors' level 7 interrupt is edge-sensitive and therefore ignores the interrupt disable mask, so if a semaphore is used to leave the instruction early, you may not consider it reentrant, but the initial stack allocation will still happen before the semaphore is checked.

In short, you have to be extremely careful about using this approach.

Shonda answered 27/9, 2014 at 21:10 Comment(8)
BTW in embedded work I would generally avoid passing significantly large data on the stack.Kannada
@Kannada Sure, and my original wording isn't great in this respect - any dynamic allocation on the stack introduces ambiguity, whether it's passed by value or not. My (now dated) experience with embedded systems has been with statically allocated heaps and stack usage, but if you're wanting to perform any sort of auditing of an existing codebase, is probably worth noting.Shonda
This is extremely contrived. For this to snowball into extreme error would require a design that has deep nesting of alternating function calls where one condition allocates very little memory and calls a function with a "large" stack usage and an opposing condition that allocates a lot of memory and calls a function with "small" stack usage. Not only that, but the design ensures somehow that this isn't random but that the path from the "small" usage function never leads to a function that uses a "large" amount of stack memory.Meave
The likelihood that someone would accidentally design this in their code, but not be knowledgeable to account for it when computing stack usage is pretty tiny. Pretty much the larger you application becomes, the smaller the delta between the naive approach and "exact" computation. Also, the more intractable it becomes to compute the "exact". I can't see how one could go through the sort of exacting design that would lead to the naive approach being wildly wrong. . .and somehow not be capable for accounting for how the design would impact stack usage.Meave
@Meave You missed the point, and took two comments to do it. That being, you need to understand the limitations of any approach before being able to assess its usefulness in your circumstances. The "contrived" example (my own words, five years ago) is clearly illustrative, not exhaustive. Whether the issues are relevant or significant given the code base, libraries, and use case (eg. academic vs. regulatory) is best decided by those engineers. That said, I certainly wouldn't want to use static analysis tools written by engineers who handwave away gaps in coverage in the ways you describe.Shonda
Your example gives the wrong impression. Engineers should always strive to understand the limitations of their tools. Contrived examples like these give the impression that the naive approach is bad when it's really just suboptimal and gives good results for most cases. I'm not sure what point you're making about static analysis. The goal of stack usage analysis is quite different that that of static analysis - which itself is also intractable for exact solutions. If you're claiming that one shouldn't use static analysis because nobody can provide an exact solution - that's shortsighted too.Meave
@Meave Impressions are subjective. Engineers are able to determine if information is relevant to them. This example is nothing but a concrete illustration of caveats with this approach, without overly restricting itself to context in the OP. It's not my place to assume whether any reader's dev practices preclude this from being an issue. If you took issue with this being an answer rather than comment, I would probably agree in hindsight, but I haven't found the specific nature of your comments constructive or relevant given the actual intent. There really isn't anything further to discuss.Shonda
From my point of view reading the answer, you do make assumptions about the readers practices and proscribe a course of action with while strongly implying doing otherwise would be disastrous. My comment explains why I disagree.Meave
N
7

I ended up writing a python script to implement τεκ's answer. It's too much code to post here, but can be found on github

Narcolepsy answered 20/5, 2016 at 17:7 Comment(0)
K
3

I am not familiar with the -fstack-usage and -fcallgraph-info options. However, it is always possible to figure out actual stack usage by:

  1. Allocate adequate stack space (for this experiment), and initialize it to something easily identifiable. I like 0xee.
  2. Run the application and test all its internal paths (by all combinations of input and parameters). Let it run for more than "long enough".
  3. Examine the stack area and see how much of the stack was used.
  4. Make that the stack size, plus 10% or 20% to tolerate software updates and rare conditions.
Karlotte answered 17/6, 2011 at 18:38 Comment(12)
The OP is specifically looking for a method that calculates a worst-case based on static analysis, not by runtime experimentation.Selfridge
@Michael: Well, the OP wrote Or is there some better approach to determine maximum stack usage? Certainly the age-old method will yield a useful answer. It might be easier than analyzing code paths.Karlotte
The problem with runtime analysis is that it only works if there is a very small number of possible paths. Most nontrivial programs have billions if not infinitely many.Collinear
@Gilles Billions in a program written for an embedded system with 64K of RAM?Swats
@Swats Billions of states can be encoded in just 32 bits of RAM.Collinear
@Swats StackOverflow is intended to be a general resource, and it's preferable for answers to be widely useful. Embedded systems with more than 64K certainly exist; I'm currently working on one with an RTOS and over a dozen threads.Kannada
@Gilles you might be able to encode that many states in 32-bits but how does that translate to testing all the internal paths of your application? One can encode the largest discovered prime number in a relatively small number of bits by comparison to the computing time necessary to calculate it. I'll give you credit for snark - but it makes no sense at all.Meave
This may be the "old-age" method but I would not trust it that much. Typically you'd take the results from this and apply some fudge factor, like doubling the amount of memory. The fact is, on anything beyond a trivial application, the amount of run-time you'd need for this method to give you a good estimate of stack usage increases exponentially with the number of potential function calls.Meave
@Meave I have no idea what you're trying to say in your first comment. There was no snark in mine. Even with a tiny, tiny amount of space to encode states, there can be too many possible states to analyze them all. In pretty much any real-world program containing loops, any worst-case analysis has to go through drastic measures to equate equivalent states. I.e. what you say in your second comment.Collinear
This is kind of the best/only method since gcc doesn't have a clue about interrupts. You can't track interrupts with static analysis, period.Marthena
@Marthena You must assume that the interrupt occurs at the worst point. How are you going to ensure that your testing has the interrupt occur at the worst point? So you have two call trees. One from main() and then other trees for interrupt sources. Each call tree will have a worst case stack usage. Add the mainline+interrupts. This posts method is not functionally safe nor best/only.Resh
@Swats Well, technically even billions or to be precise: log10((2^8)^16), which gives something over 38. I worked on ~32KB boot-loaders and ~128KB applications. Practically the only way to make it feasible in projects with noteworthy economic constraints is to have a very strict coding guideline. The main challenges come from: * Recursive functions, especially if their nesting level is not static. * Array allocations on stack with "variable" size or calculated by macros. * Seldom also sequential if's and switch's, where the selected branch depends on the one before.Jerkwater
N
3

There are generally two approaches - static and runtime.

Static: compile your project with -fdump-rtl-expand -fstack-usage and from the *.expand script get the call tree and stack usage of each function. Then iterate over all leaves in the call tree and calculate stack usage in each leaf and get the highest stack usage. Then compare that value with available memory on the target. This works statically and doesn't require running the program. This does not work with recursive functions. Does not work with VLA arrays. In case sbrk() operates on a linker section not on a statically preallocated buffer, it does not take dynamic allocation into account, which may grow on itself from the other side. I have a script in my tree ,stacklyze.sh that I explored this option with.

Runtime: before and after each function call check the current stack usage. Compile the code with -finstrument-functions. Then define two functions in your code that roughly should get the current stack usage and operate on them:

static unsigned long long max_stack_usage = 0;

void __cyg_profile_func_enter(void * this, void * call) __attribute__((no_instrument_function)) {
      // get current stack usage using some hardware facility or intrisic function
      // like __get_SP() on ARM with CMSIS
      unsigned cur_stack_usage = __GET_CURRENT_STACK_POINTER() - __GET_BEGINNING_OF_STACK();
      // use debugger to output current stack pointer
      // for example semihosting on ARM
      __DEBUGGER_TRANSFER_STACK_POINTER(cur_stack_usage);
      // or you could store the max somewhere
      // then just run the program
      if (max_stack_usage < cur_stack_usage) {
            max_stack_usage = max_stack_usage;
      }
      // you could also manually inspect with debugger
      unsigned long long somelimit = 0x2000;
      if (cur_stack_usage > somelimit) {
           __BREAKPOINT();
      }
}
void __cyg_profile_func_exit(void * this, void * call) __attribute__((no_instrument_function)) {
      // well, nothing
}

Before and after each function is made - you can check the current stack usage. Because function is called before stack is used within the function, this method does not take the current function stack usage - which is only one function and doesn't do much, and can be somehow mitigated by getting which function is it and then getting stack usage with -fstack-usage and adding it to result.

Nombril answered 5/5, 2021 at 9:8 Comment(0)
R
2

In general you need to combine call-graph information with the .su files generated by -fstack-usage to find the deepest stack usage starting from a specific function. Starting at main() or a thread entry-point will then give you the worst-case usage for that thread.

Helpfully the work to create such a tool has been done for you as discussed here, using a Perl script from here.

Retrusion answered 5/5, 2021 at 22:31 Comment(1)
This question was cited as a duplicate for a recent question #67398237, but it is old, and the accepted answer is arguably an incomplete or theoretical solution. I have added this information as a more complete and practical solution (one that did not exist when the question was originally asked).Retrusion

© 2022 - 2024 — McMap. All rights reserved.