How does a stackless language work?

Asked 19/6, 2009 at 3:22 Answered 29/1, 2014 at 11:44

I've heard of stackless languages. However I don't have any idea how such a language would be implemented. Can someone explain?

Kreis answered 19/6, 2009 at 3:22 Comment(1)

Registers - there's plenty of them on newer 64-bit platforms. First setting aside a few for the architectures calling convention needs. Probably use a few for references to external data. And then, any registers you have left could be used in combination with static buffers to form a virtual stack - or simply limit the functions to X bytes of local storage. – Coraleecoralie 27/8, 2020 at 1:41

The modern operating systems we have (Windows, Linux) operate with what I call the "big stack model". And that model is wrong, sometimes, and motivates the need for "stackless" languages.

The "big stack model" assumes that a compiled program will allocate "stack frames" for function calls in a contiguous region of memory, using machine instructions to adjust registers containing the stack pointer (and optional stack frame pointer) very rapidly. This leads to fast function call/return, at the price of having a large, contiguous region for the stack. Because 99.99% of all programs run under these modern OSes work well with the big stack model, the compilers, loaders, and even the OS "know" about this stack area.

One common problem all such applications have is, "how big should my stack be?". With memory being dirt cheap, mostly what happens is that a large chunk is set aside for the stack (MS defaults to 1Mb), and typical application call structure never gets anywhere near to using it up. But if an application does use it all up, it dies with an illegal memory reference ("I'm sorry Dave, I can't do that"), by virtue of reaching off the end of its stack.

Most so-called called "stackless" languages aren't really stackless. They just don't use the contiguous stack provided by these systems. What they do instead is allocate a stack frame from the heap on each function call. The cost per function call goes up somewhat; if functions are typically complex, or the language is interpretive, this additional cost is insignificant. (One can also determine call DAGs in the program call graph and allocate a heap segment to cover the entire DAG; this way you get both heap allocation and the speed of classic big-stack function calls for all calls inside the call DAG).

There are several reasons for using heap allocation for stack frames:

If the program does deep recursion dependent on the specific problem it is solving, it is very hard to preallocate a "big stack" area in advance because the needed size isn't known. One can awkwardly arrange function calls to check to see if there's enough stack left, and if not, reallocate a bigger chunk, copy the old stack and readjust all the pointers into the stack; that's so awkward that I don't know of any implementations. Allocating stack frames means the application never has to say its sorry until there's literally no allocatable memory left.
The program forks subtasks. Each subtask requires its own stack, and therefore can't use the one "big stack" provided. So, one needs to allocate stacks for each subtask. If you have thousands of possible subtasks, you might now need thousands of "big stacks", and the memory demand suddenly gets ridiculous. Allocating stack frames solves this problem. Often the subtask "stacks" refer back to the parent tasks to implement lexical scoping; as subtasks fork, a tree of "substacks" is created called a "cactus stack".
Your language has continuations. These require that the data in lexical scope visible to the current function somehow be preserved for later reuse. This can be implemented by copying parent stack frames, climbing up the cactus stack, and proceeding.

The PARLANSE programming language I implemented does 1) and 2). I'm working on 3). It is amusing to note that PARLANSE allocates stack frames from a very fast-access heap-per-thread; it costs typically 4 machine instructions. The current implementation is x86 based, and the allocated frame is placed in the x86 EBP/ESP register much like other conventional x86 based language implementations. So it does use the hardware "contiguous stack" (including pushing and poppping) just in chunks. It also generates "frame local" subroutine calls the don't switch stacks for lots of generated utility code where the stack demand is known in advance.

Downbow answered 27/6, 2009 at 16:38 Comment(19)

All the thread implementations I've seen for Windoze and Linux have the same "big stack" assumption (mostly because the "process" is just a distinguished thread with an associated address space). So all the same issues arise. For PARLANSE, I multiplex Window's threads onto zillions of "grains", each of which uses its own allocated stack frames. – Downbow 27/6, 2009 at 21:36

Perhaps to clarify, if you are happy with forking a number of subtasks limited by the number of threads your OS offers you (typically a few hundred), perhaps you can live with the big stack model offered by threads. If your parallel/concurrent computations have lots of interactions, you may needs thousands of computational elements, and then the OS thread model fails you. – Downbow 27/6, 2009 at 21:46

Haskell seriously doesn't use a call stack at all, not even one made up of linked lists through heap space. Think of it as a very advanced macro replacement language :) – Hundredpercenter 12/7, 2009 at 4:54

What does DAG stand for? – Noonday 15/3, 2016 at 10:46

DAG == Directed Acyclic Graph, en.wikipedia.org/wiki/Directed_acyclic_graph – Downbow 15/3, 2016 at 12:24

Stacks actually don't use 1 MB of memory on Windows. They reserve 1 MB of address space. There's a second setting, commit, which specifies actual RAM used. It defaults to only 4kB (1 page). This is possible because the Virtual Memory manager understands the stack. In particular, pushing additional call frames on the stack causes a CPU fault, which is caught and used to commit more RAM. If the default 1MB of address space is reserved, this will succeed at least 256 times, but there's no guarantee that if will fail on the 257th call. The reservation is a soft limit – Loner 15/3, 2016 at 14:12

@MSalters: Yes, I think I understood this. Soft limit or no, there isn't any gaurantee they can extend the stack beyond the soft limit. Personal experiences with C programs doing deep nesting suggest they don't try very hard. I suspect the notion is that if you recurse past 1Mb, your program is probably out of control and extending the stack to a full 64 Gb of space (if available) will not make things better. The main point is, if you want strong assurance that you can recurse deeply, a fixed stack is a bad choice. Zillions of computational grains aggravate the problem severely. – Downbow 15/3, 2016 at 14:21

@IraBaxter: It's also an OS versus programming language thing. Windows wouldn't care at all if you use the OS stack to hold pointers to the call frames, with the actual call frames on the heap. (Even Structured Exception Handling can deal with this). For Win32, that's 2 pointers @ 4 bytes, and would give you a maximum stack depth of 128K frames, per thread. You'd probably run out of heap space first. The main downside is that you lose the convenient push EAX, and have to use EBP as the call frame register. (But that's common anyway) – Loner 15/3, 2016 at 14:48

@MSalters: Yes. Most of this answer was about using heap-allocated records, not exactly as you suggested but in a similar way. – Downbow 15/3, 2016 at 14:49

@IraBaxter why is there a link to your company's product at the end of the answer? It doesn't add anything to the answer and isn't used in context. – Newsprint 6/10, 2016 at 5:15

@xaxxon: It provides explicit proof that one can implement the concepts provided in the answer. – Downbow 6/10, 2016 at 11:8

That link doesn't provide any information that backs up the claims made in the answer. You are just advertising. – Newsprint 6/10, 2016 at 11:12

You can say what you like; readers appear to like this answer based on votes. I designed PARLANSE specifically to solve hard parallelism programs, which required a stackless solution with a cactus stack (the non-parallel answers here don't require that). The link shows that one can implement this as a production quality tool. The fact that it is parallel and has unbounded recursion/forking is implicit proof, even if that is not obvious to you. – Downbow 6/10, 2016 at 11:20

"One can awkwardly arrange function calls ... so awkward I don't know of any implementations that do this". I'm pretty sure Go is such an implementation. – Berylberyle 23/4, 2018 at 19:7

@weberc: Can you elaborate a bit? For instance, do they really keep track of all pointers in the stack and adjust them? How does this work in the face of multiple threads executing in the same scope? – Downbow 26/4, 2018 at 10:18

@IraBaxter Unfortunately I'm not sure, but I know Go has growable stacks that start out quite small (which allows us to run millions of goroutines at once). I briefly tried searching for "golang growable stacks", but I didn't see much in the way of the implementation details. – Berylberyle 27/4, 2018 at 14:25

@weber2c: Well, "growable" doesn't tell you if the stack is contiguous. You can get that effect by allocating individual frames and linking them together. To do the contiguous growable stack you have to relocate pointer values, or somehow insist that all pointers are relative to a known, changeable base. And that's hard. – Downbow 27/4, 2018 at 14:55

A bit off-top, however, it is interesting enough that the C11 standard does not even mention the word "stack". So, out of curiosity: is there any stackless computer? – Coney 11/5, 2022 at 15:16

I can't think of any modern stackless computers, certainly not those used for workstations or general computation,. Certain many computers in the 60s and 70s didn't have stacks, including the DEC PDP-8s, and not even IBM mainframes. But just because the instruction doesn't implement a stack, doesn't mean the software can't implement one, indeed, rather trivially. All you need is a memory area to be used as stack, and a pointer into that area. Push requires writing a value at the pointer, and then incrementing it. Pop requires decrementing it. So its only a few machine instructions. – Downbow 11/5, 2022 at 16:38

Stackless Python still has a Python stack (though it may have tail call optimization and other call frame merging tricks), but it is completely divorced from the C stack of the interpreter.

Haskell (as commonly implemented) does not have a call stack; evaluation is based on graph reduction.

Hundredpercenter answered 19/6, 2009 at 3:29 Comment(1)

Note: Haskell does have a call stack: #1016718 – Briefcase 19/11, 2011 at 1:11

There is a nice article about the language framework Parrot. Parrot does not use the stack for calling and this article explains the technique a bit.

Cathexis answered 27/6, 2009 at 20:24 Comment(1)

The link is dead. Here's the version archived by the Wayback Machine: web.archive.org/web/20100706035639/http://www.linux-mag.com/… – Suspend 18/11, 2019 at 12:51

In the stackless environments I'm more or less familiar with (Turing machine, assembly, and Brainfuck), it's common to implement your own stack. There is nothing fundamental about having a stack built into the language.

In the most practical of these, assembly, you just choose a region of memory available to you, set the stack register to point to the bottom, then increment or decrement to implement your pushes and pops.

EDIT: I know some architectures have dedicated stacks, but they aren't necessary.

Bluish answered 19/6, 2009 at 3:24 Comment(3)

some assembly languages have push/pop and call/return built in, and the stack pointer is a dedicated cpu register. That's what I noticed when I programmed on the z80 anyway. – Murielmurielle 19/6, 2009 at 3:30

You're right though, I suppose you could implement those using other operations if necessary. – Murielmurielle 19/6, 2009 at 3:31

In fact, there's nothing fundamental about building most features into most langauges. Wolframs minimal Turing Machine wolframscience.com/prizes/tm23/background.html is adequate to implement anything. The point of language features is to make complex computations easier to express. "Stacks" are not mentioned as features in most languages, but recursion is allowed because you can solve lots of useful problems with it. And if you have recursion, you don't wan't to program "stack like" behavior by hand. – Downbow 13/9, 2009 at 17:25

Call me ancient, but I can remember when the FORTRAN standards and COBOL did not support recursive calls, and therefore didn't require a stack. Indeed, I recall the implementations for CDC 6000 series machines where there wasn't a stack, and FORTRAN would do strange things if you tried to call a subroutine recursively.

For the record, instead of a call-stack, the CDC 6000 series instruction set used the RJ instruction to call a subroutine. This saved the current PC value at the call target location and then branches to the location following it. At the end, a subroutine would perform an indirect jump to the call target location. That reloaded saved PC, effectively returning to the caller.

Obviously, that does not work with recursive calls. And my recollection is that the CDC FORTRAN IV compiler would generate broken code if you did attempt recursion.

But there was also Pascal implementation for CDC (from ETH I suspect) where recursion >did< work. So presumably the Pascal runtime >did< have a proper stack and >didn't< use RJ. My recollection is that it was a compiler rather than an interpreter, but I was a mere undergraduate user and it was a long time ago.

Bother answered 9/9, 2009 at 4:20 Comment(4)

Right. As long as you limit the size of the call tree, you can statically allocate all the space needed for activation records (in theory; most applications still have limited call trees, but it is almost impossible for the compiler to figuure out such a layout if there is any call from A to A indirectly). But now all modern versions of FORTRAN and COBOL allow recursion, and stack like behavior has to occur somewhere to implement it. – Downbow 13/9, 2009 at 17:30

@IraBaxter - true ... but that's not how they did it in the old days. See my update. – Bother 15/8, 2016 at 3:42

What they did in the "old days" was simply allocate any storage needed by the function as a static global. This gave them a place to put the return address, and any arguments, and gave them a place to put temporary values needed when evaluating complex expressions. This works as long as no subroutine is called twice in a call chain. (Yes, some really ancient call instructions put the return address at the effective address and set the PC to address plus 1. Those instructions are long gone from modern instruction sets, as it produces so-called "self modifying code".) – Downbow 15/8, 2016 at 4:3

Real self-modifying code was the FORTRAN "computed goto" statement. The CDC RJ was just an implementation artifact of FORTRAN. It didn't have the nasty (code spaghetti!) aspects of self-modification provided that you didn't break the recursion restriction of the language. Now this would not work if the code segment was read-only, but the hardware didn't support that. (The system ran jobs one at a time, and the core / privileged parts of the OS ran on a separate processor called a PPU.) – Bother 12/9, 2017 at 1:53

There is an easy to understand description of continuations on this article: http://www.defmacro.org/ramblings/fp.html

Continuations are something you can pass into a function in a stack-based language, but which can also be used by a language's own semantics to make it "stackless". Of course the stack is still there, but as Ira Baxter described, it's not one big contiguous segment.

Karate answered 27/6, 2009 at 20:35 Comment(0)

Say you wanted to implement stackless C. The first thing to realize is that this doesn't need a stack:

a == b

But, does this?

isequal(a, b) { return a == b; }

No. Because a smart compiler will inline calls to isequal, turning them into a == b. So, why not just inline everything? Sure, you will generate more code but if getting rid of the stack is worth it to you then this is easy with a small tradeoff.

What about recursion? No problem. A tail-recursive function like:

bang(x) { return x == 1 ? 1 : x * bang(x-1); }

Can still be inlined, because really it's just a for loop in disguise:

bang(x) {
    for(int i = x; i >=1; i--) x *= x-1;
    return x;
}

In theory a really smart compiler could figure that out for you. But a less-smart one could still flatten it as a goto:

ax = x;
NOTDONE:
if(ax > 1) {
    x = x*(--ax);
    goto NOTDONE;
}

There is one case where you have to make a small trade off. This can't be inlined:

fib(n) { return n <= 2 ? n : fib(n-1) + fib(n-2); }

Stackless C simply cannot do this. Are you giving up a lot? Not really. This is something normal C can't do well very either. If you don't believe me just call fib(1000) and see what happens to your precious computer.

Covell answered 29/1, 2014 at 11:44 Comment(1)

"Stackless" PARLANSE can do this (fib) just fine (see my answer). The complaint about fib(1000) is true but irrelevant; there are plenty of recursive functions that one can implement on a decent "stackless" implementation (just as one can do this on a "stackful" implementation). [We often do recursions over a million deep, just not fib]. – Downbow 11/5, 2014 at 1:54

Please feel free to correct me if I'm wrong, but I would think that allocating memory on the heap for each function call frame would cause extreme memory thrashing. The operating system does after all have to manage this memory. I would think that the way to avoid this memory thrashing would be a cache for call frames. So if you need a cache anyway, we might as well make it contigous in memory and call it a stack.

Inappreciable answered 12/9, 2009 at 4:41 Comment(2)

If you make it contiguous, you have to place a bound on its size. And the bound will prevent you from processing complex recursive applications of large scale. If you want undbounded recursion, either you need an unbounded contiguous stack, or someplace you have to break it into pieces. – Downbow 13/9, 2009 at 17:27

... and yes, one should use some kind of activation record pool to help ensure locality. With that, it doesn't thrash. – Downbow 28/9, 2009 at 0:38

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags