Is a sign or zero extension required when adding a 32bit offset to a pointer for the x86-64 ABI?

Asked 19/4, 2016 at 1:2 Answered 21/4, 2016 at 5:38

Solved assembly x86-64 compiler-optimization abi sign-extension

Summary: I was looking at assembly code to guide my optimizations and see lots of sign or zero extensions when adding int32 to a pointer.

void Test(int *out, int offset)
{
    out[offset] = 1;
}
-------------------------------------
movslq  %esi, %rsi
movl    $1, (%rdi,%rsi,4)
ret

At first, I thought my compiler was challenged at adding 32bit to 64bit integers, but I've confirmed this behavior with Intel ICC 11, ICC 14, and GCC 5.3.

This thread confirms my findings, but it's not clear if the sign or zero extension is necessary. This sign/zero extension would only be necessary if the upper 32bits aren't already set. But wouldn't the x86-64 ABI be smart enough to require that?

I'm kind of reluctant to change all my pointer offsets to ssize_t because register spills will increase the cache footprint of the code.

Adenocarcinoma answered 19/4, 2016 at 1:2 Comment(7)

int is a signed type. Use unsigned, or better size_t if you don't want the sign-extension. – Tighe 19/4, 2016 at 1:6

I tried that, but the compiler just replaces the sign extend with a zero extend, which I don't want either – Adenocarcinoma 19/4, 2016 at 1:9

I've looked through the SYS V x86-64 ABI and don't see many references to sign/zero extension. But I found the same problem happens when mixing 32bit pointers with 16bit offsets. – Adenocarcinoma 19/4, 2016 at 1:30

Well, the SysV ABI doesn't mandate that the upper 32 bits of a 64-bit register be zeroed when passing a 32-bit type. Compile this and look at the generated assembly: void foo(uint32_t); void bar(uint64_t x){foo(x);} – Tighe 19/4, 2016 at 1:31

@Tighe excellent point. So this proves the upper 32bits can be undefined quite often. I thought in 64-bit mode, every instruction computes a 64bit result making conversion in the callee unnecessary, with maybe the only exception being legacy 32bit assembly code (e.g. mov ah, 3), which is discouraged since partial register writes are slow. I think this is the best answer since it explains how common the upper 32bits are undefined since C converts int64 to int32 by truncation. If you write it up, I'll accept it. – Adenocarcinoma 19/4, 2016 at 19:0

Consider how to compile int64_t t = something; foo((int)t, arr[t]). You need to compute t in a 64bit register because the array indexing uses all 64 bits. If you computed it in %rdi, it's already in the right place for a call to foo, but has high garbage. BTW, @EOF: the ABI seems to have some unwritten rules about extending 8b and 16b to 32b. I was surprised, but see my answer. – Bronez 21/4, 2016 at 5:46

I thought in 64-bit mode, every instruction computes a 64bit no, the default parameter size in x86_64 is 32 bits. Every 64-bit op or access to the high registers needs a REX prefix so it'll be longer and not used unless necessary – Twentyfour 21/4, 2016 at 5:53

Yes, you have to assume that the high 32 bits of an arg or return-value register contains garbage. On the flip side, you are allowed to leave garbage in the high 32 when calling or returning yourself. i.e. the burden is on the receiving side to ignore the high bits, not on the passing side to clean the high bits.

You need to sign or zero extend to 64 bits to use the value in a 64-bit effective address. In the x32 ABI, gcc frequently uses 32-bit effective addresses instead of using 64-bit operand-size for every instruction modifying a potentially-negative integer used as an array index.

The standard:

The x86-64 SysV ABI only says anything about which parts of a register are zeroed for _Bool (aka bool). Page 20:

When a value of type _Bool is returned or passed in a register or on the stack, bit 0 contains the truth value and bits 1 to 7 shall be zero (footnote 14: Other bits are left unspecified, hence the consumer side of those values can rely on it being 0 or 1 when truncated to 8 bit)

Also, the stuff about %al holding the number of FP register args for varargs functions, not the whole %rax.

There's an open github issue about this exact question on the github page for the x32 and x86-64 ABI documents.

The ABI doesn't place any further requirements or guarantees on the contents of the high parts of integer or vector registers holding args or return values, so there aren't any. I have confirmation of this fact via email from Michael Matz (one of the ABI maintainers): "Generally, if the ABI doesn't say something is specified, you cannot rely on it."

He also confirmed that e.g. clang >= 3.6's use of an addps that could slow down or raise extra FP exceptions with garbage in high elements is a bug (which reminds me I should report that). He adds that this was an issue once with an AMD implementation of a glibc math function. Normal C code can leave garbage in high elements of vector regs when passing scalar double or float args.

Actual behaviour which is not (yet) documented in the standard:

Narrow function arguments, even _Bool/bool, are sign or zero-extended to 32 bits. clang even makes code that depends on this behaviour (since 2007, apparently). ICC17 doesn't do it, so ICC and clang are not ABI-compatible, even for C. Don't call clang-compiled functions from ICC-compiled code for the x86-64 SysV ABI, if any of the first 6 integer args are narrower than 32-bit.

This doesn't apply to return values, only args: gcc and clang both assume that return-values they receive only have valid data up to the width of the type. gcc will make functions returning char that leave garbage in the high 24 bits of %eax, for example.

A recent thread on the ABI discussion group was a proposal to clarify the rules for extending 8 and 16-bit args to 32 bits, and maybe actually modify the ABI to require this. The major compilers (except ICC) already do it, but it would be a change to the contract between callers and callees.

Here's an example (check it out with other compilers or tweak the code on the Godbolt Compiler Explorer, where I've included many simple examples that only demonstrate one piece of the puzzle, as well as this that demonstrates a lot):

extern short fshort(short a);
extern unsigned fuint(unsigned int a);

extern unsigned short array_us[];
unsigned short lookupu(unsigned short a) {
  unsigned int a_int = a + 1234;
  a_int += fshort(a);                 // NOTE: not the same calls as the signed lookup
  return array_us[a + fuint(a_int)];
}

# clang-3.8 -O3  for x86-64.    arg in %rdi.  (Actually in %di, zero-extended to %edi by our caller)
lookupu(unsigned short):
    pushq   %rbx                      # save a call-preserved reg for out own use.  (Also aligns the stack for another call)
    movl    %edi, %ebx                # If we didn't assume our arg was already zero-extended, this would be a movzwl (aka movzx)
    movswl  %bx, %edi                 # sign-extend to call a function that takes signed short instead of unsigned short.
    callq   fshort(short)
    cwtl                              # Don't trust the upper bits of the return value.  (This is cdqe, Intel syntax.  eax = sign_extend(ax))
    leal    1234(%rbx,%rax), %edi     # this is the point where we'd get a wrong answer if our arg wasn't zero-extended.  gcc doesn't assume this, but clang does.
    callq   fuint(unsigned int)
    addl    %ebx, %eax                # zero-extends eax to 64bits
    movzwl  array_us(%rax,%rax), %eax # This zero-extension (instead of just writing ax) is *not* for correctness, just for performance: avoid partial-register slowdowns if the caller reads eax
    popq    %rbx
    retq

Note: movzwl array_us(,%rax,2) would be equivalent, but no smaller. If we could depend on the high bits of %rax being zeroed in fuint()'s return value, the compiler could have used array_us(%rbx, %rax, 2) instead of using the add insn.

Performance implications

Leaving the high32 undefined is intentional, and I think it's a good design decision.

Ignoring the high 32 is free when doing 32-bit ops. A 32-bit operation zero-extends its result to 64-bit for free, so you only need an extra mov edx, edi or something if you could have used the reg directly in a 64-bit addressing mode or 64-bit operation.

Some functions won't save any insns from having their args already extended to 64-bit, so it's a potential waste for callers to always have to do it. Some functions use their args in a way that requires the opposite extension from the signedness of the arg, so leaving it up to the callee to decide what to do works well.

Zero-extending to 64-bit regardless of signedness would be free for most callers, though, and might have been a good choice ABI design choice. Since arg regs are clobbered anyway, the caller already needs to do something extra if it wants to keep a full 64-bit value across a call where it only passes the low 32. Thus it usually only costs extra when you need a 64-bit result for something before the call, and then pass a truncated version to a function. In x86-64 SysV, you can generate your result in RDI and use it, and then call foo which will only look at EDI.

16-bit and 8-bit operand-sizes often lead to false dependencies (AMD, P4, or Silvermont, and later SnB-family), or partial-register stalls (pre SnB) or minor slowdowns (Sandybridge), so the undocumented behaviour of requiring 8 and 16b types to be extended to 32b for arg-passing makes some sense. See Why doesn't GCC use partial registers? for more details on those microarchitectures.

This probably not a big deal for code-size in real code, since tiny functions are / should be static inline, and arg-handling insns are a small part of bigger functions. Inter-procedural optimization can remove overhead between calls when the compiler can see both definitions, even without inlining. (IDK how well compilers do at this in practice.)

I'm not sure whether changing function signatures to use uintptr_t will help or hurt overall performance with 64-bit pointers. I wouldn't worry about stack space for scalars. In most functions, the compiler pushes/pops enough call-preserved registers (like %rbx and %rbp) to keep its own variables live in registers. A tiny bit extra space for 8B spills instead of 4B is negligible.

As far as code-size, working with 64-bit values requires a REX prefix on some insns that wouldn't have otherwise needed one. Zero-extending to 64-bit happens for free if any operations are required on a 32-bit value before it gets used as an array index. Sign-extension always takes an extra instruction if it's required. But compilers can sign-extend and work with it as a 64-bit signed value from the start to save instructions, at the cost of needing more REX prefixes. (Signed overflow is UB, not defined to wrap around, so compilers can often avoid redoing sign-extension inside a loop with an int i that uses arr[i].)

Modern CPUs usually care more about insn count than insn size, within reason. Hot code will often be running from the uop cache in CPUs that have them. Still, smaller code can improve density in the uop cache. If you can save code size without using more or slower insns, then it's a win, but not usually worth sacrificing anything else for unless it's a lot of code size.

Like maybe one extra LEA instruction to allow [reg + disp8] addressing for a dozen later instructions, instead of disp32. Or xor eax,eax before multiple mov [rdi+n], 0 instructions to replace the imm32=0 with a register source. (Especially if that allows micro-fusion where it wouldn't be possible with a RIP-relative + immediate, because what really matters is front-end uop count, not instruction count.)

Bronez answered 21/4, 2016 at 5:38 Comment(7)

That's a lot of arcane, but still a treasure trove of information you've dug up. Thanks. The main question now is what best practices should be used for choosing array index types. Currently, I've been using ssize_t for practically all numbers that participate in address calculations. This seems to work well in general, but might not be necessary or even not optimal based on your findings. So, I think I'll change my strategy to, using ssize_t in all top level functions (so that there's never sign or zero extension). Then for leaf code or hot loops, take advantage of int32 when possible. – Adenocarcinoma 21/4, 2016 at 19:24

The cases where int32 is equally as fast or faster that you mention are: 1. free zero extension for 32bit operations. I'm reluctant about this because then you'll have to use unsigned types which is more error prone because of overflow. 2. 32bit multiplies are faster than 64bit before Nehalem and probably on other architectures as well. 3. smaller code size from not using a REX prefix. --------------- It's a shame x86-64 has all these quirks. ARM64 doesn't have this problem - it's able to use the lower half of a 64bit register in address calculations directly. – Adenocarcinoma 21/4, 2016 at 19:38

@YaleZhang: If you see any measurable speed difference, please let me know. I've wondered the same thing, and have occasionally looked at the code with signed int vs. unsigned int, and it can be clunky either way depending on the context. The biggest downside to unsigned is that the compiler must emit code that behaves correctly when it wraps, unlike for int (signed overflow is undefined behaviour). This can allow more optimization – Bronez 23/4, 2016 at 0:31

For what it's worth, icc seems to break the defacto (undocumented) standard you mentioned above: it does not sign/zero extend small-than-32-bit arguments to 32-bits. Here's an example. Note that it just calls consumer(char a) with arbitrary garbage in bits 8-31 of edi. – Grasshopper 17/3, 2017 at 23:39

@BeeOnRope: well spotted. That's nasty, so apparently you can't safely call clang-compiled functions from ICC-compiled code, if there are any narrow integer args. – Bronez 24/3, 2017 at 15:8

Your second sentence doesn't read well to me. Do you mean "You are not ...", or do you mean "You are also ..." (and in the latter case the final word is most confusing). – Gerladina 30/4, 2019 at 17:37

@GregA.Woods: thanks for the feedback. Edited to clarify. I said "though" because if you look at it from the other perspective it gives you more freedom / it's a possible advantage. But I agree that meaning might not come through clearly. – Bronez 30/4, 2019 at 17:43

As EOF's comment indicates the compiler can't assume that upper 32 bits of a 64-bit register used to pass a 32-bit argument has any particular value. That makes the sign or zero extension necessary.

The only way to prevent this would be to use a 64-bit type for the argument, but this moves the requirement to extend the value to the caller, which may not be improvement. I wouldn't worry too much about the size of register spills though, since the way you're doing it now it's probably more likely that after the extension the original value will be dead and it's the 64-bit extended value that will be spilled. Even if it's not dead the compiler may still prefer to spill the 64-bit value.

If you're really concerned about your memory footprint and you don't need the larger 64-bit address space you might look at the x32 ABI which uses the ILP32 types but supports the full 64-bit instruction set.

Outflank answered 19/4, 2016 at 3:53 Comment(6)

x32 is a data-size win for pointer-heavy data structures, and I think for code-size, too. However, it often has to use address-size prefixes when the compiler can't prove that a 64bit addressing mode won't go outside the low 4G. (e.g. [eax + disp] will wrap, but [rax + disp] won't, so an address-size prefix is needed unless the compiler can somehow prove something about the address, and/or the index if indexing with another register). – Bronez 19/4, 2016 at 4:56

@PeterCordes I'm surprised that the compiler worries about that since I can't see how the how the address can wrap without invoking undefined behaviour. On the other hand I can see it using the override to avoid having to zero extend the register in a situation similar to the original poster's problem. – Outflank 19/4, 2016 at 5:9

I was surprised, too, for the same reason: how is that useful? AFAICT it does use it even after zeroing the upper 32 by doing a 32bit operation or something. At this point it's probably more of a safe but not perfectly optimal implementation – Bronez 19/4, 2016 at 6:2

Good point about paying the zero/sign extension cost in the caller instead of callee, which could make the code size bigger without reducing the # instructions executed. I'm not that worried about register spills since 32bit reads/writes have the same latency and throughput as 64bit ones if they hit the cache and the chance of causing additional cache misses is low. – Adenocarcinoma 19/4, 2016 at 19:7

I added my own answer with way more details. Most interesting: clang depends on callers sign or zero extending narrow args to 32bit. – Bronez 21/4, 2016 at 5:48

Even more interesting: per my comment above, icc does not sign/zero extend even to 32-bits. So clang and icc are mutually incompatible. gcc is compatible with icc since even though it does extend to 32-bits, it doesn't seem to rely on it (yet) when implementing functions. – Grasshopper 17/3, 2017 at 23:48

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

The standard:

Actual behaviour which is not (yet) documented in the standard:

Performance implications

Recommended topics

Hot tags