What <4GB workloads would have worse performance in the Linux x32 ABI than x64?

Asked 15/10, 2012 at 19:56 Answered 15/10, 2012 at 21:8

linux performance x86 x86-64 linux-x32-abi

There is a relatively new Linux ABI referred to as x32, where the x86-64 processor runs in 32-bit mode, so pointers are still only 32-bits, but the 64-bit architecture specific registers are still used. So you're still limited to 4GB max memory use as in normal 32-bit, but your pointers use up less cache space than they do in 64-bit, you can do 64-bit arithmetic efficiently, and you get access to more registers (16) than you would in vanilla 32-bit (8).

Assuming you have a workload that fits nicely within 4GB, is there any way the performance of x32 could be worse than on x86-64?

It seems to me that if you don't need the extra memory space nothing is lost -- you should always get the same perf (when you already fit in cache) or better (when the pointer space savings lets you fit more in cache). But it wouldn't surprise me if there are paging/TLB/etc. details that I don't know about.

Salaidh answered 15/10, 2012 at 19:56 Comment(6)

The evil is in the details, so I won't be very surprised if on some rare occasions, in your conditions, sometimes x32 could be a little bit worse than x86-64. But I don't believe it is common.... (you could imagine that alignment constraints are less strong on x32, and that might rarely hurt the cache performance). – Preprandial 15/10, 2012 at 20:7

Keep in mind that pointer size is not the only difference between the two ABIs - x86-64 also has more registers, which can reduce the number of load/store instructions, and quite a few other differences. As a result, there's not really a simple answer to this question, and benchmarking/testing would almost always be the best route to determine which is "better" by whatever definition of "better" is important to that particular project. – Star 15/10, 2012 at 20:49

@twalberg: I think you may have misread the question -- x32 and x86-64 have the same number of registers. I'm not talking about normal 32-bit. – Salaidh 15/10, 2012 at 21:4

possible duplicate of Are 64 bit programs bigger and faster than 32 bit versions? – Downes 15/10, 2012 at 21:31

@JosephGarvin Ah... nevermind... I was thinking of x86-64 running in the legacy 32-bit mode, not running in long mode with self-imposed restrictions... – Star 15/10, 2012 at 21:41

@BenVoigt: That's not the same question, x32 != 32-bit. Someone actually asks about this in the first comment on the first answer there, so I don't think this is covered. – Salaidh 15/10, 2012 at 23:50

Certainly if you have a multithreaded program, the fact that data structures are smaller on x32 might cause cache line fighting between threads -- different objects might get allocated on the same cache line in x32 mode and different cache lines in x86_64 mode. If two threads modify those objects independently the cache ping-ponging could severely slow down the x32 code. Of course, this kind of cache effect could happen regardless of pointer size, but if the code has been tuned assuming 64-bit pointers, going to 32-bit pointers could de-tune things.

Anacreon answered 15/10, 2012 at 20:23 Comment(2)

+1 for being possible, but in practice you should be aligning any data that might be touched by two threads to a cache line. – Salaidh 15/10, 2012 at 21:6

@JosephGarvin: Yes, but the alignment may have been done assuming a particular pointer size. If someone pads stuff so it fills a cache line with 64-bit pointers, changing to 32-bit pointers without updating the padding may be a problem. This is mostly just an issue if you're taking existing, tuned source code and recompiling in x32 mode with no changes. – Anacreon 15/10, 2012 at 22:21

In X32 the processor is actually executing in "long mode", the same mode as for x86_64. That is, addresses as seen by the processor when doing addressing are still 64 bits, however the X32 ABI makes sure that all addresses are small enough to fit into 32 bits. As a result of this, in some case there is some slight overhead when pointers have to be zero extended from 32 bits to 64.

Also, needing x86/x86-64/x32 libraries in RAM, which I suppose is what one will end up with in practice (unless you're talking about some embedded or other tightly controlled system rather than a general purpose computer), may eat up some of the benefit of X32.

Manful answered 15/10, 2012 at 21:8 Comment(8)

Aren't the pointers actually sign-extended? And I don't believe there's any performance penalty to a 32-bit load or store instruction in long mode, both sign extension and zero extension are extremely cheap operations handled in hardware during the same cycle (no delay added). – Downes 15/10, 2012 at 21:39

I think embedded and tightly controlled systems is the intended target, so I doubt the library RAM usage issue would crop up. – Salaidh 15/10, 2012 at 23:53

@BenVoigt: They might indeed be sign extended rather than zero, I forget which. And no, there is no penalty for 32-bit load/store in long mode, rather the opposite as rXX register encodings take more space than the 32-bit reg encodings. And yes, sign/zero extensions are very cheap, though they do take up a tiny bit of decoder BW and bloat the code. – Manful 16/10, 2012 at 8:21

@JosephGarvin: So do I, though a lot of the excitement around X32 on the interwebs seem to come from people who are excited by some desktop benchmark potentially going a few percent faster. :-/ – Manful 16/10, 2012 at 8:23

nowadays some macOS is 64-bit only and many Linux distros don't have 32-bit support by default, so it's possible to have x32 only or x32/x86-64 only to save library memory usage – Oira 28/10, 2022 at 3:50

@Oira In principle. In practice X32 is pretty much dead. – Manful 28/10, 2022 at 5:48

GCC targeting x32 often uses a 67h address-size prefix on every instruction with a memory operand, so it doesn't need to sign-extend int / long / intptr_t from 32-bit to 64-bit for use with addressing modes like [rdi + rsi*4]. Instead it uses [edi + esi*4], wrapping after binary addition (so it works for signed negative) before hardware zero-extends to 64-bit. GCC's still somewhat naive about knowing that it can use [rdi + 4] (since the offset is known non-negative) instead of [edi+4], but at least it avoids 67h for RSP stack addresses. – Omentum 28/10, 2022 at 14:0

See gcc.gnu.org/bugzilla/show_bug.cgi?id=82267 - back in 2018, -mx32 would even do movl (%edi), %eax ; movq (%eax), %rax for two derefs of a long long **p. The (%edi) is maybe needed if the calling convention allows high garbage, the 2nd is definitely not. Neither -maddress-mode=long nor -maddress-mode=short are optimal for code-size. – Omentum 28/10, 2022 at 14:3

Recommended topics

Hot tags