Both r10
and r11
are call-clobbered registers, aka volatile, you can overwrite them without saving/restoring in any leaf or non-leaf function. That's what C compilers do, and expect from functions their code calls, because that's what the ABI doc says: What registers are preserved through a linux x86-64 function call
Your caller will expect them to hold garbage after return. Just like arg-passing registers such as RDI or RCX. (And RDX if it's not part of a wide RDX:RAX return value.)
The x86-64 System V ABI doesn't name its calling convention "cdecl". It's just the x86-64 SysV calling convention. The string "cdecl" doesn't appear in the ABI doc.
r11
is a temporary, aka call-clobbered register. r11
is never used for passing or returning anything, so it's safe even for wrapper / trampoline / hook functions to clobber it even if they want to forward all args and return all return values, unlike any other register. For example, lazy dynamic linker code.
r10
is also call-clobbered. The ABI says "used for passing a function’s static chain pointer". In languages that use nested functions, this is an extra incoming arg to such functions so they can find the local vars of the outer scope. A pointer-to-nested-function needs a code pointer and a static chain pointer for the caller to pass when dereferencing.
It's a "chain" because there can be multiple levels of nesting with their stack frames forming a linked list, and "static" because it's based on lexical scope, not the call stack. (Thanks @Raymond Chen for explaining the terminology.)
But like all arg-passing registers to function calls (not system calls) in x86-64 System V, it's call-clobbered (like in most calling conventions generally). You only need to worry about this usage of r10
if you're hooking or wrapping nested functions, ones defined inside another function. If you're just writing a function that's called normally, it's a pure temporary.
GCC does use r10
as part of its trampoline for function pointers to GNU C nested functions, for a pointer to the stack frame of the outer scope. The trampoline of machine code on the stack is a hack, but this is indeed a static chain pointer; languages with proper support for nested functions (unlike C and C++) would probably have the caller aware of it (like a lambda / closure) and passing a value in r10
when using using pointer to a nested function.
In a normal function, RBX, RBP, and RSP are call-preserved, along with R12..R15. All others can be clobbered without saving/restoring. (That includes xmm/ymm0..15 and zmm0..31 / k0..7, and the mmx/x87 stack, and the condition codes in RFLAGS).
Note that r8..15
need a REX prefix, even with 32-bit operand-size (like xor r10d, r10d
). If you have some 64-bit non-pointer integers, then sure keep them in r8..r11 because you always need a REX prefix for 64-bit operand-size any time you use those values anyway.
Smaller code-size is usually not worse, and sometimes helps with decode and uop-cache density, and L1i cache density. RAX, RCX,RDX, RSI,RDI should be your first choices for scratch regs. (And use 32-bit operand-size unless you need 64-bit. e.g. xor eax,eax
is the correct way to zero RAX. Silvermont doesn't recognize xor r10,r10
as a zeroing idiom, so use xor r10d,r10d
even though it doesn't save code size.)
If you do run out of low registers, ideally use r8..r11
for things that will normally be used with 64-bit operand-size (or VEX prefixes) anyway. e.g. pointers to 64-bit data or pointers to pointers. mov eax, [r10]
needs a REX prefix while mov eax, [rdi]
doesn't. But mov rax, [rdi]
and mov r8, [r10]
are the same size.
It's hard to gain much because you often need to use different values together in different combinations, like eventually using cmp eax, r10d
or whatever, but if you want to go all-out on optimizing, then think about code-size.
Maybe also think about where the instruction boundaries are and how it will fit into the uop cache. (And the JCC erratum for Skylake.)
See the x86 tag wiki, and especially http://agner.org/optimize/ for tips on writing efficient code.