As Trillian's answer points out, AMD K8 and K10 have a problem with branch prediction when ret
is a branch target, or follow a conditional branch (as the fall-through target). That's because ret
is only 1 byte long.
repz ret: why all the hassle? has some extra details about the specific micro-architectural reasons why that gives K8 and Barcelona a hard time.
Avoiding 1-byte ret
as a possible branch target:
AMD's optimization guide for K10 (Barcelona) recommends 3-byte ret 0
in those cases, which pops zero bytes from the stack as well as returning. That version is significantly worse than rep ret
on Intel. Ironically, it's also worse than rep ret
on later AMD processors (Bulldozer and onwards.) So it's a good thing nobody changed to using ret 0
based on AMD's Family 10 optimization guide update.
The processor manuals warn that future processors could differently interpret a combination of a prefix and an instruction that it doesn't modify. That's true in theory, but nobody's going to make a CPU that can't run a lot of existing binaries.
gcc still uses rep ret
by default (without -mtune=intel
, or -march=haswell
or something). So most Linux binaries have a repz ret
in them somewhere.
gcc will probably stop using rep ret
in a few years, once K10 is thoroughly obsolete. After another 5 or 10 years, almost all binaries will be built with a gcc newer than that. Another 15 years after that, a CPU manufacturer might think about repurposing the f3 c3
byte sequence as (part of) a different instruction.
There will still be legacy closed-source binaries using rep ret
that don't have more recent builds available, and that someone needs to keep running, though. So whatever new feature f3 c3 != rep ret
is part of would need to be disable-able (e.g. with a BIOS setting), and have that setting actually change the instruction-decoder behaviour to recognize f3 c3
as rep ret
. If that backwards-compatibility for legacy binaries isn't possible (because it can't be done power efficiently in terms of power and transistors), IDK what kind of time-frame you'd be looking at. Much longer than 15 years, unless this was a CPU for only part of the market.
So it's safe to use rep ret
, because everyone else is already doing it. Using ret 0
is a bad idea. In new code, it's may still a good idea to use rep ret
for another couple years. There probably aren't too many AMD PhenomII CPUs still around, but they're slow enough without extra return-address mispredicts or w/e the problem is.
The cost is pretty small. It doesn't end up taking any extra space in most cases, because it's usually followed by nop
padding anyway. However, in the cases where it does result in extra padding, it'll be the worst-case where 15B of padding is needed to reach the next 16B boundary. gcc may only align by 8B in that case. (with .p2align 4,,10;
to align to 16B if it will take 10 or fewer nop bytes, then a .p2align 3
to always align to 8B. Use gcc -S -o-
to produce asm output to stdout to see when it does this.)
So if we guesstimate that one in 16 rep ret
end up creating extra padding where a ret
would have just hit the desired alignment, and that the extra padding goes to an 8B boundary, this means each rep
has an average cost of 8 * 1/16 = half a byte.
rep ret
isn't used often enough to add up to much of anything. For example, firefox with all the libraries it has mapped is only has ~9k instances of rep ret
. So that's about 4k bytes, across many files. (And less RAM than that, since many of those functions in dynamic libraries are never called.)
# disassemble every shared object mapped by a process.
ffproc=/proc/$(pgrep firefox)/
objdump -d "$ffproc/exe" $(sudo ls -l "$ffproc"/map_files/ |
awk '/\.so/ {print $NF}' | sort -u) |
grep 'repz ret' -c
objdump: '(deleted)': No such file # I forgot to restart firefox after the libexpat security update
9649
That counts rep ret
in all the functions in all the libraries firefox has mapped, not just the functions it ever calls. This is somewhat relevant, because lower code density across functions means your calls are spread out over more memory pages. ITLB and L2-TLB only have a limited number of entries. Local density matters for L1I$ (and Intel's uop-cache). Anyway, rep ret
has a very tiny impact.
It took me a minute to think of a reason that /proc/<pid>/map_files/
isn't accessible to the owner of the process, but /proc/<pid>/maps
is. If a UID=root process (e.g. from a suid-root binary) mmap(2)
s a 0666 file that's in a 0700 directory, then does setuid(nobody)
, anyone running that binary could bypass the access restriction imposed by the lack of x for other
permission on the directory.