How do Java runtimes targeting pre-SSE2 processors implement floating-point basic operations?

How does(did) a Java runtime targeting an Intel processor without SSE2 deal with floating-point denormals, when strictfp is set?

Even when the 387 FPU is set for 53-bit precision, it keeps an oversized exponent range that:

forces to detect underflow/overflow at each intermediate result, and
makes it difficult to avoid double-rounding of denormals.

Strategies include re-computing the operation that resulted in a denormal value with emulated floating-point, or a permanent exponent offset along the lines of this technique to equip OCaml with 63-bit floats, borrowing a bit from the exponent in order to avoid double-rounding.

In any case, I see no way to avoid at least one conditional branch for each floating-point computation, unless the operation can statically be determined not to underflow/overflow. How exceptional (overflow/underflow) cases are dealt with is part of my question, but this cannot be separated from the question of the representation (the permanent exponent offset strategy seems to mean that only overflows need to be checked for, for instance).

It looks to me, from a very trivial test case, like the JVM round-trips every double computation through memory to get the rounding it wants. It also seems to do something weird with a couple of magic constants. Here's what it did for me for a simple "compute 2^n naively" program:

0xb1e444b0: fld1
0xb1e444b2: jmp    0xb1e444dd         ;*iload
                                      ; - fptest::calc@9 (line 6)
0xb1e444b7: nop
0xb1e444b8: fldt   0xb523a2c8         ;   {external_word}
0xb1e444be: fmulp  %st,%st(1)
0xb1e444c0: fmull  0xb1e44490         ;   {section_word}
0xb1e444c6: fldt   0xb523a2bc         ;   {external_word}
0xb1e444cc: fmulp  %st,%st(1)
0xb1e444ce: fstpl  0x10(%esp)
0xb1e444d2: inc    %esi               ; OopMap{off=51}
                                      ;*goto
                                      ; - fptest::calc@22 (line 6)
0xb1e444d3: test   %eax,0xb3f8d100    ;   {poll}
0xb1e444d9: fldl   0x10(%esp)         ;*goto
                                      ; - fptest::calc@22 (line 6)
0xb1e444dd: cmp    %ecx,%esi
0xb1e444df: jl     0xb1e444b8         ;*if_icmpge
                                      ; - fptest::calc@12 (line 6)

I believe 0xb523a2c8 and 0xb523a2bc are _fpu_subnormal_bias1 and _fpu_subnormal_bias2 from the hotspot source code. _fpu_subnormal_bias1 looks to be 0x03ff8000000000000000 and _fpu_subnormal_bias2 looks to be 0x7bff8000000000000000. _fpu_subnormal_bias1 has the effect of scaling the smallest normal double to the smallest normal long double; if the FPU rounds to 53 bits, the "right thing" will happen.

I'd speculate that the seemingly-pointless test instruction is there so that the thread can be interrupted by marking that page unreadable in the event that a GC is necessary.

Here's the Java code:

import java.io.*;
public strictfp class fptest {
 public static double calc(int k) {
  double a = 2.0;
  double b = 1.0;
  for (int i = 0; i < k; i++) {
   b *= a;
  }
  return b;
 }
 public static double intest() {
  double d = 0;
  for (int i = 0; i < 4100; i++) d += calc(i);
  return d;
 }
 public static void main(String[] args) throws Exception {
  for (int i = 0; i < 100; i++)
   System.out.println(intest());
 }
}

Digging further, the code for these operations is in plain sight in the OpenJDK code in hotspot/src/cpu/x86/vm/x86_63.ad. Relevant snippets:

instruct strictfp_mulD_reg(regDPR1 dst, regnotDPR1 src) %{
  predicate( UseSSE<=1 && Compile::current()->has_method() && Compile::current()
->method()->is_strict() );
  match(Set dst (MulD dst src));
  ins_cost(1);   // Select this instruction for all strict FP double multiplies

  format %{ "FLD    StubRoutines::_fpu_subnormal_bias1\n\t"
            "DMULp  $dst,ST\n\t"
            "FLD    $src\n\t"
            "DMULp  $dst,ST\n\t"
            "FLD    StubRoutines::_fpu_subnormal_bias2\n\t"
            "DMULp  $dst,ST\n\t" %}
  opcode(0xDE, 0x1); /* DE C8+i or DE /1*/
  ins_encode( strictfp_bias1(dst),
              Push_Reg_D(src),
              OpcP, RegOpc(dst),
              strictfp_bias2(dst) );
  ins_pipe( fpu_reg_reg );
%}

instruct strictfp_divD_reg(regDPR1 dst, regnotDPR1 src) %{
  predicate (UseSSE<=1);
  match(Set dst (DivD dst src));
  predicate( UseSSE<=1 && Compile::current()->has_method() && Compile::current()
->method()->is_strict() );
  ins_cost(01);

  format %{ "FLD    StubRoutines::_fpu_subnormal_bias1\n\t"
            "DMULp  $dst,ST\n\t"
            "FLD    $src\n\t"
            "FDIVp  $dst,ST\n\t"
            "FLD    StubRoutines::_fpu_subnormal_bias2\n\t"
            "DMULp  $dst,ST\n\t" %}
  opcode(0xDE, 0x7); /* DE F8+i or DE /7*/
  ins_encode( strictfp_bias1(dst),
              Push_Reg_D(src),
              OpcP, RegOpc(dst),
              strictfp_bias2(dst) );
  ins_pipe( fpu_reg_reg );
%}

I see nothing for addition and subtraction, but I'd bet they just do an add/subtract with the FPU in 53-bit mode and then round-trip the result through memory. I'm a little curious whether there's a tricky overflow case that they get wrong, but I'm not curious enough to find out.

Recommended topics

Hot tags