Why do modern compilers prefer SSE over FPU for single floating-point operations
Asked Answered
D

1

6

I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code:

float square(float a) {
    float b = a + (a * a);
    return b;
} 

will be compiled into

push    rbp
mov     rbp, rsp
movss   DWORD PTR [rbp-20], xmm0
movss   xmm0, DWORD PTR [rbp-20]
mulss   xmm0, xmm0
movss   xmm1, DWORD PTR [rbp-20]
addss   xmm0, xmm1
movss   DWORD PTR [rbp-4], xmm0
movss   xmm0, DWORD PTR [rbp-4]
pop     rbp
ret

and the result is similar for other compilers. https://godbolt.org/z/G988PGo6j

And with -O3 flag

movaps  xmm1, xmm0
mulss   xmm0, xmm0
addss   xmm0, xmm1
ret

Does this mean operations using SIMD registers and instructions are usually faster than using normal registers and the FPU?

Also, I'm curious about specific cases where the compiler's decision to use SSE might fail.

Deposit answered 11/9, 2024 at 11:52 Comment(2)
Not going to change your question but you should probably enable optimisations (aka -O3) when looking at compiler generated code as compilers can sometimes do weird things when optimisations are not enabled.Kor
With the exception of divide all the binary mathematical operations are essentially single cycle with varying latency on SSE/AVX/AVX2 architecture. I'm not sure how many cycles they take on x87 these days but x87 code tends to be glacially slow by comparison with SSE or higher SIMD instructions (even for a scalar). It is especially true of sin, cos ,exp, tan, pow, sqrt where SSE code is way faster even though only sqrt has hardware support. Unless you really need 80bit FP x87 code is best avoided.Rasla
L
11

SSE was developed as a replacement for the x87 FPU as the x87 FPU's design is a bit idiosyncratic and hard to generate code for. The main issues are:

  • code generation for stack-based processors like the x87 FPU is not as well understood as for register-based processors, making the code generated by many compilers full of inefficient extra fxch, fld, and fst(p) instructions. This is much easier to get right with a register-based architecture like SSE.
  • SSE supports SIMD operation, greatly speeding up floating point operations on arrays if used. X87 supports no such thing
  • on the x87 FPU, the data type used is Intel's 80 bit floating point format. The precision of floating-point operations is configured using a control register and is expensive to change. Therefore, compilers generating code for the x87 unit would run all computations with full precision, even when only single precision is called for by the programmer. This both changes the result slightly and reduces performance as higher precision operations may take more time to complete. Additionally, each load and store from/to the x87 unit involves an implicit data type conversion. The SSE unit on the other hand encodes precision in the instruction used, allowing the compiler to use exactly the precision the programmer called for.
  • recently, CPU manufacturers have reduced investments into improving the x87 FPU (or even took back existing improvements, like fxch being a rename), leading to a widening gap in performance between x87 and SSE.

I recommend to only use the x87 FPU if code size is an issue or if you require the 80 bits floating point format. Otherwise stick with SSE or (on recent processors) AVX.

Lalita answered 11/9, 2024 at 12:19 Comment(7)
Also, x86-64 has 16 SSE register, still only 8 x87 registers, so less room to hold constants and variables even in cases where a compiler could figure out how to usefully use that many registers without lots of fxch.Outbrave
The last bullet point that x87 performance is worsening on recent CPUs is a consequence of the OP's observation, not a reason: making x87 worse doesn't hurt modern programs (unless they use 80-bit long double). This decision by CPU architects came years after compilers had already started to favour SSE for scalar math even in 32-bit code. (And many more programs are built as 64-bit even for Windows; x86-64 compilers have always defaulted to SSE for scalar math.)Outbrave
@PeterCordes It is a consequence of the general trend, but also a reason to follow it.Lalita
Yeah, definitely worth mentioning, and good point about it being relevant for anyone writing asm by hand or creating their own code-gen tools like compilers or JITs now.Outbrave
@PeterCordes and in AVX-512/AVX-10 you have 32 SSE/AVX registers, far more than in x87Checkbook
@phuclv: True, so that's another factor like fuz was talking about that further tilts the balance in favour of XMM for scalar. 16 XMM regs are baseline for x86-64, being another significant factor in choice of the standard way to do scalar FP math when that ISA was new, baked into calling conventions and early compiler defaults.Outbrave
For the reasons stated, x87 numerical computations are unreliable if you want bit-exact reproduction. Different compilers (and versions) might utilize the stack differently and thus have conversions to 64bit floats for temporary memory storage at different places, thus changing intermediary values etc. #15147674Scranton

© 2022 - 2025 — McMap. All rights reserved.