How do Java runtimes targeting pre-SSE2 processors implement floating-point basic operations?
Asked Answered
T

1

12

How does(did) a Java runtime targeting an Intel processor without SSE2 deal with floating-point denormals, when strictfp is set?

Even when the 387 FPU is set for 53-bit precision, it keeps an oversized exponent range that:

  1. forces to detect underflow/overflow at each intermediate result, and
  2. makes it difficult to avoid double-rounding of denormals.

Strategies include re-computing the operation that resulted in a denormal value with emulated floating-point, or a permanent exponent offset along the lines of this technique to equip OCaml with 63-bit floats, borrowing a bit from the exponent in order to avoid double-rounding.

In any case, I see no way to avoid at least one conditional branch for each floating-point computation, unless the operation can statically be determined not to underflow/overflow. How exceptional (overflow/underflow) cases are dealt with is part of my question, but this cannot be separated from the question of the representation (the permanent exponent offset strategy seems to mean that only overflows need to be checked for, for instance).

Towery answered 28/8, 2013 at 19:24 Comment(3)
@ChrisJester-Young Thanks for helping make the question clearer.Towery
I don't know the answer to your question. If you have such a machine, though, you can pass the flags -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly to see what code it generates.Watterson
@Watterson I don't even know if I have a Java runtime installed. I only have a perverse interest in the emulation of double-precision with the 387. Also someone once gave me a reference to a graduate student's memoir on this very topic, that I forgot to archive and now cannot find again. It was not you, was it?Towery
W
10

It looks to me, from a very trivial test case, like the JVM round-trips every double computation through memory to get the rounding it wants. It also seems to do something weird with a couple of magic constants. Here's what it did for me for a simple "compute 2^n naively" program:

0xb1e444b0: fld1
0xb1e444b2: jmp    0xb1e444dd         ;*iload
                                      ; - fptest::calc@9 (line 6)
0xb1e444b7: nop
0xb1e444b8: fldt   0xb523a2c8         ;   {external_word}
0xb1e444be: fmulp  %st,%st(1)
0xb1e444c0: fmull  0xb1e44490         ;   {section_word}
0xb1e444c6: fldt   0xb523a2bc         ;   {external_word}
0xb1e444cc: fmulp  %st,%st(1)
0xb1e444ce: fstpl  0x10(%esp)
0xb1e444d2: inc    %esi               ; OopMap{off=51}
                                      ;*goto
                                      ; - fptest::calc@22 (line 6)
0xb1e444d3: test   %eax,0xb3f8d100    ;   {poll}
0xb1e444d9: fldl   0x10(%esp)         ;*goto
                                      ; - fptest::calc@22 (line 6)
0xb1e444dd: cmp    %ecx,%esi
0xb1e444df: jl     0xb1e444b8         ;*if_icmpge
                                      ; - fptest::calc@12 (line 6)

I believe 0xb523a2c8 and 0xb523a2bc are _fpu_subnormal_bias1 and _fpu_subnormal_bias2 from the hotspot source code. _fpu_subnormal_bias1 looks to be 0x03ff8000000000000000 and _fpu_subnormal_bias2 looks to be 0x7bff8000000000000000. _fpu_subnormal_bias1 has the effect of scaling the smallest normal double to the smallest normal long double; if the FPU rounds to 53 bits, the "right thing" will happen.

I'd speculate that the seemingly-pointless test instruction is there so that the thread can be interrupted by marking that page unreadable in the event that a GC is necessary.

Here's the Java code:

import java.io.*;
public strictfp class fptest {
 public static double calc(int k) {
  double a = 2.0;
  double b = 1.0;
  for (int i = 0; i < k; i++) {
   b *= a;
  }
  return b;
 }
 public static double intest() {
  double d = 0;
  for (int i = 0; i < 4100; i++) d += calc(i);
  return d;
 }
 public static void main(String[] args) throws Exception {
  for (int i = 0; i < 100; i++)
   System.out.println(intest());
 }
}

Digging further, the code for these operations is in plain sight in the OpenJDK code in hotspot/src/cpu/x86/vm/x86_63.ad. Relevant snippets:

instruct strictfp_mulD_reg(regDPR1 dst, regnotDPR1 src) %{
  predicate( UseSSE<=1 && Compile::current()->has_method() && Compile::current()
->method()->is_strict() );
  match(Set dst (MulD dst src));
  ins_cost(1);   // Select this instruction for all strict FP double multiplies

  format %{ "FLD    StubRoutines::_fpu_subnormal_bias1\n\t"
            "DMULp  $dst,ST\n\t"
            "FLD    $src\n\t"
            "DMULp  $dst,ST\n\t"
            "FLD    StubRoutines::_fpu_subnormal_bias2\n\t"
            "DMULp  $dst,ST\n\t" %}
  opcode(0xDE, 0x1); /* DE C8+i or DE /1*/
  ins_encode( strictfp_bias1(dst),
              Push_Reg_D(src),
              OpcP, RegOpc(dst),
              strictfp_bias2(dst) );
  ins_pipe( fpu_reg_reg );
%}

instruct strictfp_divD_reg(regDPR1 dst, regnotDPR1 src) %{
  predicate (UseSSE<=1);
  match(Set dst (DivD dst src));
  predicate( UseSSE<=1 && Compile::current()->has_method() && Compile::current()
->method()->is_strict() );
  ins_cost(01);

  format %{ "FLD    StubRoutines::_fpu_subnormal_bias1\n\t"
            "DMULp  $dst,ST\n\t"
            "FLD    $src\n\t"
            "FDIVp  $dst,ST\n\t"
            "FLD    StubRoutines::_fpu_subnormal_bias2\n\t"
            "DMULp  $dst,ST\n\t" %}
  opcode(0xDE, 0x7); /* DE F8+i or DE /7*/
  ins_encode( strictfp_bias1(dst),
              Push_Reg_D(src),
              OpcP, RegOpc(dst),
              strictfp_bias2(dst) );
  ins_pipe( fpu_reg_reg );
%}

I see nothing for addition and subtraction, but I'd bet they just do an add/subtract with the FPU in 53-bit mode and then round-trip the result through memory. I'm a little curious whether there's a tricky overflow case that they get wrong, but I'm not curious enough to find out.

Watterson answered 28/8, 2013 at 21:23 Comment(8)
Seeing the code, without fully understanding it yet, suggests a kind of variation of the “exponent offset” method. The code must be doing something a bit like this: 1- multiply one of the arguments by 2^-K1 so that the result of the extended-exponent multiplication is a denormal with the same number of effective digits as the result would normally have with standard exponent 2- do the product 3- multiply the result by 2^K2 so that the result is either exact, or +inf (exactly when the product should overflow with standard exponent) 4- multiply by 2^(K1-K2). Now, where is the third constant?Towery
Wait, in your method calc one of the operands of the multiplication is rather obviously constant. This one can be pre-biased in a way such that only two multiplications remain to ensure correct results for underflows and overflows.Towery
Yes, but the JVM doesn't do that. I think what's going on is: (1) the FPU is in 53-bit mode; (2) one weird constant scales the smallest double subnormal to the smallest long double subnormal and the other scales it back, thus handling underflow; (3) the round trip through a double in memory takes care of exponent overflow. So I guess you don't get double-rounding for multiplication.Watterson
Avoiding double rounding is necessary. The objective is to make the value of a floating point expression independent of software or hardware implementation. Doing double rounding in some cases on some machines would defeat that purpose.Undue
Ah, yes, the multiplication by 2^K2 in my scheme can be replaced by a round-trip through memory. And maybe it is for the best, because 2^K2 is just a bit too large to be represented as a 80-bit float: two actual floating-point multiplications would have been necessary to implement the “multiply by 2^K2” step.Towery
@PatriciaShanahan: Yes, but the question is whether Java actually bothers to get it right. My money would have been on "no" before I started digging.Watterson
“I'm a little curious whether there's a tricky overflow case that they get wrong”: That's easy. If long_dbl1 and long_dbl2 are representable as doubles, then they are multiples of 2^-1074, hence long_dbl1 + long_dbl2 is a multiple of 2^-1074, hence a round-trip through memory is enough to convert it to the nearest double (where long_dbl? represents a value with 53-bit significand and extended exponent).Towery
@PascalCuoq: At the other end of the spectrum, though, you've got 0x1.fffffffffffff8p+1023 and larger rounding to infinity. (You're right that the round-trip takes care of it, though, as long as the FPU is in 53-bit mode. I wasn't thinking too clearly when I wrote that.)Watterson

© 2022 - 2024 — McMap. All rights reserved.