Why can't kernel code use a Red Zone

I

4

28

It is highly recommended when creating a 64-bit kernel (for x86_64 platform), to instruct the compiler not to use the 128-byte Red Zone that the user-space ABI does. (For GCC the compiler flag is -mno-red-zone).

The kernel would not be interrupt-safe if it is enabled.

But why is that?

Illgotten answered 20/8, 2014 at 13:36 Comment(1)

Related: #38042688 and #37942279 have answers explaining what the red zone is all about for code that can use it. – Bailar 26/6, 2016 at 22:24

I

17

Quoting from the AMD64 ABI:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

Essentially, it's an optimization - the userland compiler knows exactly how much of the Red Zone is used at any given time (in the simplest implementation, the entire size of local variables) and can adjust the %rsp accordingly before calling a sub-function.

Especially in leaf functions, this can yield some performance benefits of not having to adjust %rsp as we can be certain no unfamiliar code would run while in the function. (POSIX Signal Handlers might be seen as a form of a co-routine, but you can instruct the compiler to adjust the registers before using stack variables in a signal handler).

In the kernel space, once you start thinking about interrupts, if those interrupts make any assumptions about %rsp, they will likely be incorrect - there is no certainty with regards to the utilization of the Red Zone. So, you either assume all of it is dirty, and needlessly waste stack space (effectively running with a 128-byte guaranteed local variable in every function), or, you guarantee that the interrupts make no assumptions about %rsp - which is tricky.

In user space, context switches + 128-byte overallocation of stack handle it for you.

Isisiskenderun answered 20/8, 2014 at 15:9 Comment(5)

It's not just space-saving. It's actually impossible to implement the normal 128-byte red-zone safely, because interrupts always clobber the 16 bytes below %rsp before any code from the interrupt handler even runs. – Bailar 26/6, 2016 at 21:51

@qdot, could you please explain what you mean by 128-byte overallocation? It means that if amd64 ABI did not have "red zone" concept, the lowest address stack could grow upon would be 128 bytes higher? – Miller 19/9, 2016 at 5:9

POSIX signal are delivered to handlers by the kernel, not the hardware. The kernel simply respects the ABI's red-zone when delivering signals that don't use sigaltstack. The relevant code in the kernel isn't compiler-generated. That's why normal functions can be registered as signal handlers; they don't need any special __attribute__ to compile specially. – Bailar 1/7, 2021 at 6:12

And on Linux at least, libc doesn't have to silently substitute a wrapper function for the real address in sigaction(2). It only tells the kernel what return address to pass to that user-space function call, getting it to return to a special libc function that uses sigreturn(2). (That man page describes the Linux mechanism where the kernel puts the thread's register state onto the user-space stack.) – Bailar 1/7, 2021 at 6:15

Also, on x86-64, local variables go below the return address, so reserving 128 bytes of space for a dummy local variable wouldn't help. Having a return address above that would clobber a red-zone. (Unlike on ISAs with a link register where a normal function call gets its return address in a register, not stack memory. Although interrupts on most ISAs still implicitly use a stack.) – Bailar 1/7, 2021 at 6:16

B

21

It is possible to use red-zone in kernel-type contexts. The IDTentry can specify a stack index (ist) of 0..7, where 0 is a bit special. The TSS contains a table of these stacks. 1..7 are loaded, and used for the initial registers saved by the exception/interrupt, and do not nest. If you partition the various exception entries by priorities (eg. NMI is the highest and can happen at any time) and treat these stacks as trampolines, you can safely handle red zones in kernel-type contexts. That is, you can subtract 128 from the saved stack pointer to get a usable kernel stack before enabling interrupts or code which can cause exceptions.

The zero index stack behaves in a more conventional manner, pushing the stack,flags,pc,error on the existing stack when there is no privilege transition.

The code in the trampoline has to be careful (duh, it is a kernel) not to generate other exceptions while it sanitizes the machine state, but provides a nice, safe spot to detect pathological kernel nesting, stack corruption, etc... [ sorry to respond so late, noticed this while searching for something else].

Butler answered 3/3, 2017 at 17:54 Comment(1)

Please upvote this person some more. This is why the red zone was introduced in the ABI - it's universally usable if you actually use the 64-bit TSS and IST mechanism, specifically created to make this work. – Gailey 12/8, 2019 at 9:15

I

17

Quoting from the AMD64 ABI:

The 128-byte area beyond the location pointed to by %rsp is considered to be reserved and shall not be modified by signal or interrupt handlers. Therefore, functions may use this area for temporary data that is not needed across function calls. In particular, leaf functions may use this area for their entire stack frame, rather than adjusting the stack pointer in the prologue and epilogue. This area is known as the red zone.

Essentially, it's an optimization - the userland compiler knows exactly how much of the Red Zone is used at any given time (in the simplest implementation, the entire size of local variables) and can adjust the %rsp accordingly before calling a sub-function.

Especially in leaf functions, this can yield some performance benefits of not having to adjust %rsp as we can be certain no unfamiliar code would run while in the function. (POSIX Signal Handlers might be seen as a form of a co-routine, but you can instruct the compiler to adjust the registers before using stack variables in a signal handler).

In the kernel space, once you start thinking about interrupts, if those interrupts make any assumptions about %rsp, they will likely be incorrect - there is no certainty with regards to the utilization of the Red Zone. So, you either assume all of it is dirty, and needlessly waste stack space (effectively running with a 128-byte guaranteed local variable in every function), or, you guarantee that the interrupts make no assumptions about %rsp - which is tricky.

In user space, context switches + 128-byte overallocation of stack handle it for you.

Isisiskenderun answered 20/8, 2014 at 15:9 Comment(5)

It's not just space-saving. It's actually impossible to implement the normal 128-byte red-zone safely, because interrupts always clobber the 16 bytes below %rsp before any code from the interrupt handler even runs. – Bailar 26/6, 2016 at 21:51

@qdot, could you please explain what you mean by 128-byte overallocation? It means that if amd64 ABI did not have "red zone" concept, the lowest address stack could grow upon would be 128 bytes higher? – Miller 19/9, 2016 at 5:9

POSIX signal are delivered to handlers by the kernel, not the hardware. The kernel simply respects the ABI's red-zone when delivering signals that don't use sigaltstack. The relevant code in the kernel isn't compiler-generated. That's why normal functions can be registered as signal handlers; they don't need any special __attribute__ to compile specially. – Bailar 1/7, 2021 at 6:12

And on Linux at least, libc doesn't have to silently substitute a wrapper function for the real address in sigaction(2). It only tells the kernel what return address to pass to that user-space function call, getting it to return to a special libc function that uses sigreturn(2). (That man page describes the Linux mechanism where the kernel puts the thread's register state onto the user-space stack.) – Bailar 1/7, 2021 at 6:15

Also, on x86-64, local variables go below the return address, so reserving 128 bytes of space for a dummy local variable wouldn't help. Having a return address above that would clobber a red-zone. (Unlike on ISAs with a link register where a normal function call gets its return address in a register, not stack memory. Although interrupts on most ISAs still implicitly use a stack.) – Bailar 1/7, 2021 at 6:16

B

16

In kernel-space, you're using the same stack that interrupts use. When an interrupt happens, the CPU pushes a return address and RFLAGS. This clobbers 16 bytes below rsp. Even if you wanted to write an interrupt-handler that assumed the full 128 bytes of the red-zone were valuable, it would be impossible.

You could maybe have a kernel-internal ABI that had a small red-zone from rsp-16 to rsp-48 or something. (Small because kernel stack is valuable, and most functions don't need very much red-zone anyway.)

Interrupt handlers would have to sub rsp, 32 before pushing any registers. (and restore it before iret).

This idea won't work if an interrupt handler can itself be interrupted before it runs sub rsp, 32, or after it restores rsp before an iret. There would be a window of vulnerability where valuable data is at rsp .. rsp-16.

Another practical problem with this scheme is that AFAIK gcc doesn't have configurable red-zone parameters. It's either on or off. So you'd have to add support for a kernel flavour of red-zone to gcc / clang if you wanted to take advantage of it.

Even if it was safe from nested interrupts, the benefits are pretty small. The difficulty of proving it's safe in a kernel might make it not worth it. (And as I said, I'm not at all sure it can be implemented safely, because I think nested interrupts are possible.)

(BTW, see the x86 tag wiki for links to the ABI documenting the red-zone, and other stuff.)

Bailar answered 26/6, 2016 at 21:48 Comment(2)

A bit unsure about why it won't work cf. the sentence "This idea won't work if an interrupt handler can itself be interrupted before it runs sub rsp, 32, or after it restores rsp before an iret. There would be a window of vulnerability where valuable data is at rsp .. rsp-16.". Wouldn't the "second interrupt" handler do the sub rsp,32'ing too, thus protecting the assumed red zone of the original interrupted code? Is it because there will be multiple nested return-addresses+RFLAGS pushed (by the CPU itself) which could eventually overwrite the red zone or? – Seraph 23/6, 2020 at 11:13

@Morty: not if a 2nd or 3rd nested interrupt is handled by hardware before software can run sub rsp,32. An exception / interrupt frame is more than 16 bytes: at least RIP, CS, RFLAGS, and for synchronous exceptions an exception-type code, IIRC. And if nested can happen, double-nested can theoretically happen, so even sub rsp, 2*max_single_frame is in theory not enough, and neither is any arbitrary size. – Bailar 23/6, 2020 at 15:39

P

1

I will give you an example of the quote of wikipedia:

The red zone is well-known to cause problems for x86-64 kernel developers, as the CPU itself doesn't respect the red zone when calling interrupt handlers. This leads to a subtle kernel breakage as the ABI contradicts the CPU behavior.

In my kernel, I use Linux memcpy() c function:

void *memcpy(void *dest, const void *src,
                size_t count)
{
    char *tmp = dest;
    const char *s = src;

    while (count--)
        *tmp++ = *s++;
    return dest;
}

And the disassembly is:

0000000000000000 <memcpy>:
   0:   f3 0f 1e fa             endbr64 
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   48 8d 05 f9 ff ff ff    lea    -0x7(%rip),%rax        # 8 <memcpy+0x8>
   f:   49 bb 00 00 00 00 00    movabs $0x0,%r11
  16:   00 00 00 
  19:   4c 01 d8                add    %r11,%rax
  1c:   48 89 7d e8             mov    %rdi,-0x18(%rbp)
  20:   48 89 75 e0             mov    %rsi,-0x20(%rbp)
  24:   48 89 55 d8             mov    %rdx,-0x28(%rbp)
  28:   48 8b 45 e8             mov    -0x18(%rbp),%rax
  2c:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  30:   48 8b 45 e0             mov    -0x20(%rbp),%rax
  34:   48 89 45 f0             mov    %rax,-0x10(%rbp)
  38:   eb 1d                   jmp    57 <memcpy+0x57>
  3a:   48 8b 55 f0             mov    -0x10(%rbp),%rdx
  3e:   48 8d 42 01             lea    0x1(%rdx),%rax
  42:   48 89 45 f0             mov    %rax,-0x10(%rbp)
  46:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  4a:   48 8d 48 01             lea    0x1(%rax),%rcx
  4e:   48 89 4d f8             mov    %rcx,-0x8(%rbp)
  52:   0f b6 12                movzbl (%rdx),%edx
  55:   88 10                   mov    %dl,(%rax)
  57:   48 8b 45 d8             mov    -0x28(%rbp),%rax
  5b:   48 8d 50 ff             lea    -0x1(%rax),%rdx
  5f:   48 89 55 d8             mov    %rdx,-0x28(%rbp)
  63:   48 85 c0                test   %rax,%rax
  66:   75 d2                   jne    3a <memcpy+0x3a>
  68:   48 8b 45 e8             mov    -0x18(%rbp),%rax
  6c:   5d                      pop    %rbp
  6d:   c3                      retq

Note the instruction in 1c to 24, three arguments stored on stack by "mov" but not "push", the same as 2c and 34 are the two local variables.

And now is the problem. I compiled my x86_64 kernel on ubuntu, with gcc default x64 abi(sysv amd64 abi, implicit red zone). When run into this function, called by exec, surely will trigger copy-on-write(means will cause page-fault exception first), the variables address and %RSP look like: screen shot of debug session 1

You can see the %RSP is adjacent ABOVE the stored args and localvars, so guess what whill happen when exception raised on an x86_64 machine ---- cpu autosave at least 5 registers on stack ---- they override the args and localvars.

And then compiled it with option -mno-red-zone, the beginning part of disassembly:

0000000000000000 <memchr>:
   0:   f3 0f 1e fa             endbr64 
   4:   55                      push   %rbp
   5:   48 89 e5                mov    %rsp,%rbp
   8:   48 83 ec 28             sub    $0x28,%rsp
   c:   48 8d 05 f9 ff ff ff    lea    -0x7(%rip),%rax        # c <memchr+0xc>

Note the difference with the former? It preserve the stack space of args and localvars with

8:   48 83 ec 28             sub    $0x28,%rsp

And the running result:screen shot of debug session 2 Now the %RSP is BELOW the args and localvars.

So the core reason is that: in leaf function in normal case, there is no need to adjust %RSP to stack top, so with red-zone mechanism %RSP won't be adjusted. But in kernel, the kernel code and exception/interrrupt code share the kernel-stack(unless you prepare isolate stack for exception/interrupt , for X86_64 cpu it is IST), when leaf function interrupted, args and localvars will be override

Piranha answered 26/7, 2022 at 2:39 Comment(0)

Recommended topics

Hot tags