FLD instruction x64 bit

Asked 3/4, 2013 at 11:42 Answered 4/9, 2018 at 4:20

I have a little problem with FLD instruction in x64 bit ... want to load Double value to the stack pointer FPU in st0 register, but it seem to be impossible. In Delphi x32, I can use this code :

function DoSomething(X:Double):Double;
asm

  FLD    X
   // Do Something ..
  FST Result

end;

Unfortunately, in x64, the same code does not work.

Dalpe answered 3/4, 2013 at 11:42 Comment(3)

Define "does not work". Does it crash? Does it not compile? Does it not return the expected result? – Lamelliform 3/4, 2013 at 11:53

Did you read about Win64 compatibility in Delphi help ? They tell that there is not 10-bytes Extended type in Win64. And that shows that Delphi Win64 does not use FPU (x86). It uses SSE instead. Thus using FPU instructions is problematic. Also be careful when using BAsm x64 - there are bugs that destroy data or even inverse program control flow. – Rockribbed 3/4, 2013 at 12:55

in x86_64 the FPU shouldn't be used unless you need extended precision. SSE is faster and more consistent in its results – Unamuno 4/9, 2018 at 6:11

In x64 mode floating point parameters are passed in xmm-registers. So when Delphi tries to compile FLD X, it becomes FLD xmm0 but there is no such instruction. You first need to move it to memory.

The same goes with the result, it should be passed back in xmm0.

Try this (not tested):

function DoSomething(X:Double):Double;
var
  Temp : double;
asm
  MOVQ qword ptr Temp,X
  FLD Temp
  //do something
  FST Temp
  MOVQ xmm0,qword ptr Temp
end;

Disband answered 3/4, 2013 at 12:20 Comment(3)

>So when Delphi tries to compile FLD X, it becomes FLD XMM0 ... WHAT About this FLD Result !!! why the compiler accepet loading Result .. is this a bug !! – Dalpe 3/4, 2013 at 13:41

@Dalpe : It turns out that when you do "FST Result" BASM allocates a temporary storage on stack for result and then adds a extra instruction at the end to load xmm0 with this value. I did not know that. See for yourself in disassembly view in debugger. – Disband 3/4, 2013 at 16:8

Is this a bug? No. On x64 use SSE and not x87. But you should stop doing asm and let the compiler do the work. – Mezoff 3/4, 2013 at 17:13

Delphi inherite Microsoft x64 Calling Convention. So if arguments of function/procedure are float/double, they are passed in XMM0L, XMM1L, XMM2L, and XMM3L registers.

But you can use var before parameter as workaround like:

function DoSomething(var X:Double):Double;
asm
  FLD  qword ptr [X]
  // Do Something ..
  FST Result
end;

Macario answered 3/4, 2013 at 14:4 Comment(2)

Nice workaround. Limitation though that you cannot pass constant literals such as DoSomething(1.0) or variables declared as Single. – Disband 3/4, 2013 at 16:6

@Ville Krumlinde: Indeed, if you need to call function with constant param than in section const first declare the constant. :) – Macario 3/4, 2013 at 16:27

In x64 mode floating point parameters are passed in xmm-registers. So when Delphi tries to compile FLD X, it becomes FLD xmm0 but there is no such instruction. You first need to move it to memory.

The same goes with the result, it should be passed back in xmm0.

Try this (not tested):

function DoSomething(X:Double):Double;
var
  Temp : double;
asm
  MOVQ qword ptr Temp,X
  FLD Temp
  //do something
  FST Temp
  MOVQ xmm0,qword ptr Temp
end;

Disband answered 3/4, 2013 at 12:20 Comment(3)

>So when Delphi tries to compile FLD X, it becomes FLD XMM0 ... WHAT About this FLD Result !!! why the compiler accepet loading Result .. is this a bug !! – Dalpe 3/4, 2013 at 13:41

Is this a bug? No. On x64 use SSE and not x87. But you should stop doing asm and let the compiler do the work. – Mezoff 3/4, 2013 at 17:13

You don't need to use legacy x87 stack registers in x86-64 code, because SSE2 is baseline, a required part of the x86-64 ISA. You can and should do your scalar FP math using addsd, mulsd, sqrtsd and so on, on XMM registers. (Or addss for float)

The Windows x64 calling convention passes float/double FP args in XMM0..3, if they're one of the first four args to the function. (i.e. the 3rd total arg goes in xmm2 if it's FP, rather than the 3rd FP arg going in xmm2.) It returns FP values in XMM0.

Only use x87 if you actually need 80-bit precision inside your function. (Instructions like fsin and fyl2x are not fast, and can usually be done just as well by normal math libraries using SSE/SSE2 instructions.

function times2(X:Double):Double;
asm
    addsd  xmm0, xmm0       // upper 8 bytes of XMM0 are ignored
    ret
end

Storing to memory and reloading into an x87 register costs you about 10 cycles of latency for no benefit. SSE/SSE2 scalar instructions are just as fast, or faster, than their x87 equivalents, and easier to program for and optimize because you never need fxch; it's a flat register design instead of stack-based. (https://agner.org/optimize/). Also, you have 15 XMM registers.

Of course, you usually don't need inline asm at all. It could be useful for manually-vectorizing if the compiler doesn't do that for you.

Unterwalden answered 4/9, 2018 at 4:20 Comment(4)

The precision of the "legacy" one is 80 bits, while the precision of the "modern" one is 32 or 64 bits depending on instruction. You will loose precision during calculations. Which is ok for games, but not ok for certain applications. – Speaking 7/9, 2021 at 20:56

@rxantos: Lose precision compared to what, though? x86-64 compilers already use SSE2 for math on float/double, so that's the standard to compare against. Also, some 32-bit compilers (notably MSVC) set the x87 unit to 64-bit precision (53-bit mantissa) to be closer to C FLT_EVAL_METHOD=1 semantics, so that extra precision wasn't there anyway if you were using that implementation. – Unterwalden 7/9, 2021 at 21:5

@rxantos: double is enough precision for most scientific computing, and with careful numerical design for some problems you can use 32-bit float to get 2x the work done per SIMD instruction. If you really care about FP rounding errors, you can do things like Kahan summation to compensate for error while summing an array, or pairwise summation. (Unrolling with multiple SIMD accumulators is a step in that direction, typically reducing rounding error.) – Unterwalden 7/9, 2021 at 21:8

@rxantos: but yes, if you could get it for free (in terms of performance), 80-bit temporary precision is nice for many computations where you aren't intentionally compensating for it, and where the double-rounding problem (to 80-bit and then later to 64-bit) doesn't outweigh the benefits. See also randomascii.wordpress.com/2012/03/21/… re: intermediate precision especially in MSVC, also Did any compiler fully use Intel x87 80-bit floating point? – Unterwalden 7/9, 2021 at 21:10

Recommended topics

Hot tags