Why do x86-64 Linux system calls modify RCX, and what does the value mean?
Asked Answered
P

1

16

I'm trying to allocate some memory in linux with sys_brk syscall. Here is what I tried:

BYTES_TO_ALLOCATE equ 0x08

section .text
    global _start

_start:
    mov rax, 12
    mov rdi, BYTES_TO_ALLOCATE
    syscall

    mov rax, 60
    syscall

The thing is as per linux calling convention I expected the return value to be in rax register (pointer to the allocated memory). I ran this in gdb and after making sys_brk syscall I noticed the following register contents

Before syscall

rax            0xc      12
rbx            0x0      0
rcx            0x0      0
rdx            0x0      0
rsi            0x0      0
rdi            0x8      8

After syscall

rax            0x401000 4198400
rbx            0x0      0
rcx            0x40008c 4194444 ; <---- What does this value mean?
rdx            0x0      0
rsi            0x0      0
rdi            0x8      8

I do not quite understand the value in the rcx register in this case. Which one to use as a pointer to the beginning of 8 bytes I allocated with sys_brk?

Pregnable answered 26/12, 2017 at 20:23 Comment(5)
RCX and R11 are clobbered by the SYSCALL instruction itself. From the instruction set reference: after saving the address of the instruction following SYSCALL into RCX). RFLAGS gets stored into R11Riplex
@MichaelPetch Very interesting. It means in order to use, say cl register afterwards I need to clear it first, right? I mean for example xor cl, cl and then mov cl, 7.Pregnable
You can't rely on the value of RCX or R11 after the SYSCALL. So you'll have to either use one of the other registers instead of RCX and R11 (and RAX) or you will have to save the value (stack for example) and restore it after. RCX and R11 don't get set by you, you just can't use them and expect them to be the same before and after the SYSCALL.Riplex
@MichaelPetch But what's wrong with just clearing?Pregnable
Clearing it before will get overwritten by the SYSCALL. SYSCALL will just overwrite what was in it. You can set it after the SYSCALL if you wish but if you do another SYSCALL the value will be clobbered.Riplex
S
21

The system call return value is in rax, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

Note that sys_brk has a slightly different interface than the brk / sbrk POSIX functions; see the C library/kernel differences section of the Linux brk(2) man page. Specifically, Linux sys_brk sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.


The other interesting part of your question is:

I do not quite understand the value in the rcx register in this case

You're seeing the mechanics of how the syscall / sysret instructions are designed to allow the kernel to resume user-space execution but still be fast.

syscall doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.

It's not a coincidence that RCX=RIP and R11=RFLAGS after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace system call modified the process's saved rcx or r11 value while it was inside the kernel. (ptrace is the system call gdb uses). In that case, Linux would use iret instead of sysret to return to user space, because the slower general-case iret can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall in a 64-bit process, though.)


Instead of pushing a return address onto the kernel stack (like int 0x80 does), syscall:

  • sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed syscall).

  • masks RFLAGS with a pre-configured mask from a config register (the IA32_FMASK MSR). This lets the kernel disable interrupts (IF) until it's done swapgs and setting rsp to point to the kernel stack. Even with cli as the first instruction at the entry point, there'd be a window of vulnerability. You also get cld for free by masking off DF so rep movs / stos go upward even if user-space had used std.

    Fun fact: AMD's first proposed syscall / swapgs design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).

  • jumps to the configured syscall entry point (setting CS:RIP = IA32_LSTAR). The old CS value isn't saved anywhere, I think.

  • It doesn't do anything else, the kernel has to use swapgs to get access to an info block where it saved the kernel stack pointer, because rsp still has its value from user-space.

So the design of syscall requires a system-call ABI that clobbers registers, and that's why the values are what they are.

Seldom answered 27/12, 2017 at 19:6 Comment(7)
What is the use case of sysret instruction? Link you provided mentions that, it is companion instruction for syscall. But I have never seen sysret being used after a syscall instruction!!Biramous
@SouravKannanthaB: syscall calls into the kernel, sysret (in the kernel) returns to user-space. So the reason is the same as why you don't use ret after call printf, unless that happens to be the end of your function. See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some details about how the kernel int 0x80 and syscall entry points work.Seldom
If saved RCX != RIP or R11 != RFLAGS, linux kernel uses iret instead of sysret. Why not just restore %rcx/%r11 with saved RIP/RFLAGS and use sysret (I think this will be faster?)Lashkar
@FangZhen: Because of CPU / ISA design bugs. e.g. if RIP is non-canonical, the CPU will #GP. But Intel CPUs will handle that exception without updating RSP (so it's the user stack), but the CPU is still in kernel mode. User-space could easily exploit that by having another thread modify that memory which is getting used as a kernel stack, after using ptrace to create a non-canonical RIP. So some checking is needed, and since ptrace or signals changing registers of other tasks are rare, it's fastest to just use a simple check.Seldom
@FangZhen: See github.com/torvalds/linux/blob/… for example.Seldom
@PeterCordes As for the linked code, what happens if replace line 255-258 with following which avoids check RCX==RIP? ` movq RIP(%rsp), %rcx \n movq %rcx, %r11 `Lashkar
@FangZhen: That would lead to wrong behaviour for cases where ptrace wanted to set RCX in another task. (Or maybe a signal handler modifying registers). e.g. in GDB, set $rcx = 1 is something that's supposed to work, but your change would break it. (Maybe just for tasks that were stuck in a system call, or maybe for any task even if it was stopped at a breakpoint if that return path gets used there.)Seldom

© 2022 - 2024 — McMap. All rights reserved.