Is it possible to get 0 by subtracting two unequal floating point numbers?

Asked 12/2, 2015 at 9:55 Answered 18/2, 2015 at 18:15

Solved floating-point double floating-accuracy ieee-754

132

Is it possible to get division by 0 (or infinity) in the following example?

public double calculation(double a, double b)
{
     if (a == b)
     {
         return 0;
     }
     else
     {
         return 2 / (a - b);
     }
}

In normal cases it will not, of course. But what if a and b are very close, can (a-b) result in being 0 due to precision of the calculation?

Note that this question is for Java, but I think it will apply to most programming languages.

Tirewoman answered 12/2, 2015 at 9:55 Comment(18)

You have all the code.... Have you tried running it? – Gilding 12/2, 2015 at 9:57

I would have to try all combinations of doubles, that will take a while :) – Tirewoman 12/2, 2015 at 10:1

Can you not use java.lang.Double.MIN_VALUE to compare how small the difference is and force the result? Or even java.lang.Double.NEGATIVE_INFINITY? – Kenzi 12/2, 2015 at 10:2

@Tirewoman sounds like a time to use JUnit Testing to me! – Gilding 12/2, 2015 at 10:2

System.out.println(Math.sqrt(4) == 2.0000001f); prints true. and System.out.println(Math.sqrt(4) == 2.000001f); prints false. So you can get zero by subtructing two unequal floating point. – Burgrave 12/2, 2015 at 10:4

@bluebrain, my guess is that your your literal number 2.000 etc contains to many decimals to be represented by a float. So the last ones will not be represented by the actual used number in the comparison. – Tirewoman 12/2, 2015 at 10:8

@Tirewoman probably. 'you can't really guarantee that the number you assign to the float or double is exact' – Burgrave 12/2, 2015 at 10:11

Just note that returning 0 in that case can lead to hard-to-debug ambiguity, so make sure you really want to return 0 instead of throwing an exception or returning a NaN. – Amylo 12/2, 2015 at 14:45

@m0skit0, good point, the above is a simplified situation from the case I encountered. – Tirewoman 12/2, 2015 at 14:57

@bluebrain Except you are (with IEEE 754 at least). For an IEEE 754 floating point number with x bits for the mantissa (precision), you're guaranteed at least floor(log10(2^x)) decimal digits of precision. – Centroclinal 12/2, 2015 at 18:57

@ColeJohnson: Unless the float is denormal – Orvas 13/2, 2015 at 0:44

@MooingDuck Which have almost no purpose. – Centroclinal 13/2, 2015 at 1:2

I think this can error: A = +INF, B = -INF. Also, what about A = -double.max_value and B large enough not to be lost in the roundoff? – Coan 13/2, 2015 at 5:12

@bluebrain I don't understand what your println-sqrt- example is trying to show. There is no subtraction in there. – Currant 13/2, 2015 at 13:36

@bluebrain the guarantees about floats are very exact indeed. Your statement should be "you can't predict what will happen if you don't learn the rules." – Alumroot 14/2, 2015 at 6:0

have fun: docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html – Carminacarminative 14/2, 2015 at 9:46

Note: +0.0 - -0.0 is +0.0 and -0.0 == +0.0 though they are not the same number and print differently. – Redbreast 18/2, 2015 at 7:1

if you want to be sure, check (a-b) == 0 instead a==b – Chef 18/2, 2015 at 11:53

132

In Java, a - b is never equal to 0 if a != b. This is because Java mandates IEEE 754 floating point operations which support denormalized numbers. From the spec:

In particular, the Java programming language requires support of IEEE 754 denormalized floating-point numbers and gradual underflow, which make it easier to prove desirable properties of particular numerical algorithms. Floating-point operations do not "flush to zero" if the calculated result is a denormalized number.

If an FPU works with denormalized numbers, subtracting unequal numbers can never produce zero (unlike multiplication), also see this question.

For other languages, it depends. In C or C++, for example, IEEE 754 support is optional.

That said, it is possible for the expression 2 / (a - b) to overflow, for example with a = 5e-308 and b = 4e-308.

Jemmy answered 12/2, 2015 at 10:44 Comment(6)

However OP wants to know about 2/(a-b). Can this be guaranteed to be finite? – Endgame 12/2, 2015 at 10:47

Thanks for the answer, I added a link to wikipedia for the explanation of denormalized numbers. – Tirewoman 12/2, 2015 at 10:57

@Endgame See my edit. The division actually can overflow. – Jemmy 12/2, 2015 at 11:3

@Endgame (a,b) = (3,1) => 2/(a-b) = 2/(3-1) = 2/2 = 1 Whether this is true with IEEE floating point, I don't know – Centroclinal 12/2, 2015 at 18:59

A small correction: IEEE-754 is not required of any existing C++ standard, but C99 does demand IEEE-754 compliance. – Millpond 17/2, 2015 at 22:45

@DrewDormann IEEE 754 is also optional for C99. See Annex F of the standard. – Jemmy 18/2, 2015 at 10:40

As a workaround, what about the following?

public double calculation(double a, double b) {
     double c = a - b;
     if (c == 0)
     {
         return 0;
     }
     else
     {
         return 2 / c;
     }
}

That way you don't depend on IEEE support in any language.

Underhung answered 12/2, 2015 at 12:54 Comment(11)

Avoid the problem and simplify the test all at once. Me like. – Truck 12/2, 2015 at 16:24

-1 If a=b, you shouldn't be returning 0. Dividing by 0 in IEEE 754 gets you infinity, not an exception. You're avoiding the problem, so returning 0 is a bug waiting to happen. Consider 1/x + 1. If x=0, that'd result in 1, not the correct value: infinity. – Centroclinal 12/2, 2015 at 19:5

@ColeJohnson the correct answer is not infinity either (unless you specify which side the limit comes from, right side = +inf, left side = -inf, unspecified = undefined or NaN). – Headcheese 12/2, 2015 at 20:10

@NickT I'm using the logic that +/+=+ and +/-=-. Considering 0 can be positive and negative in IEEE 754, I'd say for a positive numerator, the sign of infinity depends on the sign of 0. – Centroclinal 12/2, 2015 at 22:33

-1 because the question is not about the best way to do this division, it's about whether division by zero is even possible in the example. This is answering a question which was not asked. – Capriole 13/2, 2015 at 1:36

@ChrisHayes: This is a valid answer to the question recognizing that the question may be an XY problem: meta.stackexchange.com/questions/66377/what-is-the-xy-problem – Grammarian 13/2, 2015 at 2:17

This too can return infinity. If c is denormalized 2/c can be greater than the maximum double. – Endgame 13/2, 2015 at 8:13

@ColeJohnson I was just trying to mimic the code of the OP, not inferring what he wanted as a result of the code but giving as result what he wanted. – Underhung 13/2, 2015 at 8:38

Absolutely true @ChrisHayes, that's why I start with as a workaround... but of course you're in your right of not wanting workarounds as responses – Underhung 13/2, 2015 at 8:39

@ColeJohnson Returning 0 is not really the issue. This is what the OP does in the question. You could put an exception or whatever is appropriate for the situation in that part of the block. If you don't like returning 0, that should a criticism of the question. Certainly, doing as the OP did doesn't warrant a downvote to the answer. This question has nothing to do with further computations after the given function completes. For all you know, the program's requirements necessitate returning 0. – Leeway 13/2, 2015 at 17:30

@ColeJohnson I think you miss the point, the question is not about what should be returned in case of a div by 0 (the OP want it to be zero, that it). The question is about how to detect a div by zero. – Landa 14/2, 2015 at 18:18

You wouldn't get a division by zero regardless of the value of a - b, since floating point division by 0 doesn't throw an exception. It returns infinity.

Now, the only way a == b would return true is if a and b contain the exact same bits. If they differ by just the least significant bit, the difference between them will not be 0.

EDIT :

As Bathsheba correctly commented, there are some exceptions:

"Not a number compares" false with itself but will have identical bit patterns.
-0.0 is defined to compare true with +0.0, and their bit patterns are different.

So if both a and b are Double.NaN, you will reach the else clause, but since NaN - NaN also returns NaN, you will not be dividing by zero.

Prostate answered 12/2, 2015 at 9:59 Comment(7)

Eran; not strictly true. "Not a number compares" false with itself but will have identical bit patterns. Also -0.0 is defined to compare true with +0.0, and their bit patterns are different. – Fabricate 12/2, 2015 at 10:0

@Fabricate I didn't consider these special cases. Thanks for the comment. – Prostate 12/2, 2015 at 10:1

@Eran, very good point that division by 0 will return infinity in a floating point. Added it to the question. – Tirewoman 12/2, 2015 at 10:6

@Prostate : if a is +-0.0 and b is also +-0.0 then devision will give NaN not infinity. – Zerk 12/2, 2015 at 12:54

@Zerk but the division wouldn't take place in this case, since a == b would return true. – Prostate 12/2, 2015 at 12:58

@Eran: i agree here division will not take place, but i was pointing this. – Zerk 12/2, 2015 at 13:0

Actually you could get a FP exception for division by zero, it's an option defined by the IEEE-754 standard, although it's probably not what most people would mean with "exception" ;) – Species 12/2, 2015 at 20:32

There is no case where a division by zero can happen here.

The SMT Solver Z3 supports precise IEEE floating point arithmetic. Let's ask Z3 to find numbers a and b such that a != b && (a - b) == 0:

(set-info :status unknown)
(set-logic QF_FP)
(declare-fun b () (FloatingPoint 8 24))
(declare-fun a () (FloatingPoint 8 24))
(declare-fun rm () RoundingMode)
(assert
(and (not (fp.eq a b)) (fp.eq (fp.sub rm a b) +zero) true))
(check-sat)

The result is UNSAT. There are no such numbers.

The above SMTLIB string also allows Z3 to pick an arbitrary rounding mode (rm). This means that the result holds for all possible rounding modes (of which there are five). The result also includes the possibility that any of the variables in play might be NaN or infinity.

a == b is implemented as fp.eq quality so that +0f and -0f compare equal. The comparison with zero is implemented using fp.eq as well. Since the question is aimed at avoiding a division by zero this is the appropriate comparison.

If the equality test was implemented using bitwise equality, +0f and -0f would have been a way to make a - b zero. An incorrect previous version of this answer contains mode details about that case for the curious.

Z3 Online does not yet support the FPA theory. This result was obtained using the latest unstable branch. It can be reproduced using the .NET bindings as follows:

var fpSort = context.MkFPSort32();
var aExpr = (FPExpr)context.MkConst("a", fpSort);
var bExpr = (FPExpr)context.MkConst("b", fpSort);
var rmExpr = (FPRMExpr)context.MkConst("rm", context.MkFPRoundingModeSort());
var fpZero = context.MkFP(0f, fpSort);
var subExpr = context.MkFPSub(rmExpr, aExpr, bExpr);
var constraintExpr = context.MkAnd(
        context.MkNot(context.MkFPEq(aExpr, bExpr)),
        context.MkFPEq(subExpr, fpZero),
        context.MkTrue()
    );

var smtlibString = context.BenchmarkToSMTString(null, "QF_FP", null, null, new BoolExpr[0], constraintExpr);

var solver = context.MkSimpleSolver();
solver.Assert(constraintExpr);

var status = solver.Check();
Console.WriteLine(status);

Using Z3 to answer IEEE float questions is nice because it is hard to overlook cases (such as NaN, -0f, +-inf) and you can ask arbitrary questions. No need to interpret and cite specifications. You can even ask mixed float and integer questions such as "is this particular int log2(float) algorithm correct?".

Phio answered 12/2, 2015 at 13:43 Comment(1)

Can you please add a link to SMT Solver Z3 and a link to an online interpreter? While this answer seems totally legit, someone can think that these results are wrong. – Kelsiekelso 15/2, 2015 at 0:36

The supplied function can indeed return infinity:

public class Test {
    public static double calculation(double a, double b)
    {
         if (a == b)
         {
             return 0;
         }
         else
         {
             return 2 / (a - b);
         }
    }    

    /**
     * @param args
     */
    public static void main(String[] args) {
        double d1 = Double.MIN_VALUE;
        double d2 = 2.0 * Double.MIN_VALUE;
        System.out.println("Result: " + calculation(d1, d2)); 
    }
}

The output is Result: -Infinity.

When the result of the division is to big to be stored in a double, infinity is returned even if the denominator is non-zero.

Sula answered 12/2, 2015 at 17:8 Comment(0)

In a floating-point implementation that conforms to IEEE-754, each floating-point type can hold numbers in two formats. One ("normalized") is used for most floating-point values, but the second-smallest number it can represent is only a tiny bit larger than the smallest, and so the difference between them is not representable in that same format. The other ("denormalized") format is used only for very small numbers that are not representable in the first format.

Circuitry to handle the denormalized floating-point format efficiently is expensive, and not all processors include it. Some processors offer a choice between either having operations on really small numbers be much slower than operations on other values, or having the processor simply regard numbers which are too small for normalized format as zero.

The Java specifications imply that implementations should support denormalized format, even on machines where doing so would make code run more slowly. On the other hand, it's possible that some implementations might offer options to allow code to run faster in exchange for slightly sloppy handling of values which would for most purposes be way too small to matter (in cases where values are too small to matter, it can be annoying having calculations with them take ten times as long as calculations that do matter, so in many practical situations flush-to-zero is more useful than slow-but-accurate arithmetic).

Verjuice answered 12/2, 2015 at 16:58 Comment(0)

In olden times before IEEE 754, it was quite possible that a != b didn't imply a-b != 0 and vice versa. That was one of the reasons to create IEEE 754 in the first place.

With IEEE 754 it is almost guaranteed. C or C++ compilers are allowed to do an operation with higher precision than needed. So if a and b are not variables but expressions, then (a + b) != c doesn't imply (a + b) - c != 0, because a + b could be calculated once with higher precision, and once without higher precision.

Many FPUs can be switched to a mode where they don't return denormalized numbers but replace them with 0. In that mode, if a and b are tiny normalised numbers where the difference is smaller than the smallest normalised number but greater than 0, a != b also doesn't guarantee a == b.

"Never compare floating-point numbers" is cargo cult programming. Among the people who have the mantra "you need an epsilon", most have no idea how to choose that epsilon properly.

Aerodonetics answered 14/2, 2015 at 19:14 Comment(0)

I can think of a case where you might be able to cause this to happen. Here's an analogous sample in base 10 - really, this would happen in base 2, of course.

Floating point numbers are stored more or less in scientific notation - that is, instead of seeing 35.2, the number being stored would be more like 3.52e2.

Imagine for the sake of convenience that we have a floating point unit that operates in base 10 and has 3 digits of accuracy. What happens when you subtract 9.99 from 10.0?

1.00e2-9.99e1

Shift to give each value the same exponent

1.00e2-0.999e2

Round to 3 digits

1.00e2-1.00e2

Uh oh!

Whether this can happen ultimately depends on the FPU design. Since the range of exponents for a double is very large, the hardware has to round internally at some point, but in the case above, just 1 extra digit internally will prevent any problem.

Caucasus answered 12/2, 2015 at 14:34 Comment(8)

The registers holding the aligned operands for subtraction are required to hold extra two bits, called "guard bits", to deal with this situation. In the scenario where the subtraction would cause a borrow from the most significant bit, either the smaller operand's magnitude must exceed half that of the larger operand (implying that it can only have one extra bit of precision) or else the result must be at least half the magnitude of the smaller operand (implying that it will only need only one more bit, plus information sufficient to ensure correct rounding). – Verjuice 12/2, 2015 at 16:54

“Whether this can happen ultimately depends on the FPU design” No, it cannot happen because the Java definition says it can't. The FPU design does not have anything to do with it. – Petronilapetronilla 13/2, 2015 at 1:57

@PascalCuoq: Correct me if I'm wrong, but strictfp is not enabled, it's possible for calculations to yield values which are too small for double but will fit in an extended-precision floating-point value. – Verjuice 13/2, 2015 at 16:4

@Verjuice The absence of strictfp only influences the values of “intermediate results”, and I am quoting from docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.4 . a and b are double variables, not intermediate results, so their values are double-precision values, thus are multiples of 2^-1074. The subtraction of these two double-precision values is consequently a multiple of 2^-1074, so the wider exponent range does change the property that the difference is 0 iff a == b. – Petronilapetronilla 13/2, 2015 at 16:16

@Verjuice This makes sense - you'd only need one extra bit to do this. – Caucasus 16/2, 2015 at 10:45

@PascalCuoq: Call m the smallest positive double. If two numbers are both representable as double, the difference will be a multiple of m. Then certainly double d=m*1.125; would set d to m, but I'm not sure if m==m*1.125; would be guaranteed to return true. – Verjuice 17/2, 2015 at 16:29

@Verjuice I know this, but the equality in the question is variable == variable, not expression == expression. A variable is not an “intermediate result”. – Petronilapetronilla 17/2, 2015 at 17:48

@PascalCuoq: The sample code is of that form, but I did not interpret that to mean that was the only case of interest. In cases where a subexpression is used exactly twice, it's pretty common for it to be repeated rather than assigned to a variable that would serve no other purpose, so if (a-b*c != 0) doSomething(a-b*c); would not seem implausible. – Verjuice 17/2, 2015 at 18:5

You shouldn't ever compare floats or doubles for equality; because, you can't really guarantee that the number you assign to the float or double is exact.

To compare floats for equality sanely, you need to check if the value is "close enough" to the same value:

if ((first >= second - error) || (first <= second + error)

Jehanna answered 12/2, 2015 at 10:0 Comment(19)

"Shouldn't ever" is a bit strong, but generally this is good advice. – Bane 12/2, 2015 at 10:1

It is not strong enough. Learned hard way. – Jehanna 12/2, 2015 at 10:1

While you are true, abs(first - second) < error (or <= error) is easier and more concise. – Advise 12/2, 2015 at 10:2

Thanks for the advise, I do know this, the question comes mostly from trying to understand floating point numbers better. You are very right on saying to avoid if you can't be 100% sure. – Tirewoman 12/2, 2015 at 10:3

While true in most cases (not all), doesn't really answer the question. – Haleyhalf 12/2, 2015 at 10:56

"You shouldn't ever compare floats or doubles for equality; because, you can't really guarantee that the number you assign to the float or double is exact" - if I just want to avoid NaN or infinity then checking to make sure a number isn't 0.0 is good enough as long as we follow the IEEE-754 spec (denormals and all). – Species 12/2, 2015 at 20:36

Testing floating-point numbers for equality is quite often useful. There is nothing sane about comparing with an epsilon that has not been carefully chosen, and even less sane about comparing with an epsilon when one is testing for equality. – Mciver 13/2, 2015 at 0:24

@tmyklebu: By "quite often" I presume you mean "very rarely". I wouldn't consider a use case that occurs in 0.001% (or less) of all code to be "often" – Grammarian 13/2, 2015 at 2:20

@tmyklebu, the value of your comments about 'quite often useful' is very close to 'an epsilon that has not been carefully chosen' as long as they are not supported by a single example. – Jehanna 13/2, 2015 at 20:26

@tmyklebu: Then I'd have to say that the value of testing floating point numbers for equality is very rarely useful. There is nothing sane about testing floating point numbers for equality except in a very, very, very (very) small number of use cases (if you can give me an example, that would be very useful). – Grammarian 14/2, 2015 at 2:9

@aviad: One clear example of where equality comparison is useful is when you're building a hash table with doubles as keys. Another is when you're trying to test whether a given simple function that takes double input and produces double output and has a tight enough contract works correctly. Another is exactly what's in this question; you can test whether you're evaluating a function at one of its singularities so you can handle that case appropriately. The uses of equality comparison of floating-point numbers are more diverse and not limited to these three. – Mciver 14/2, 2015 at 9:25

@tmyklebu, your clear example with hash table with double keys is not very common (otherwise it would be really awkward), because the 1st thing that will happen in hashCode function... shall I tell you? Ok... here is something you might not know: public int hashCode() { long bits = doubleToLongBits(value); return (int)(bits ^ (bits >>> 32)); } – Jehanna 14/2, 2015 at 16:2

@aviad: How doubles are hashed is implementation-dependent and not always at odds with equality comparison. You gave the hashCode method for Java's Double class, which is not the same as its double primitive. – Mciver 14/2, 2015 at 17:45

If you sort an array on a floating-point key, I can guarantee that your code won't work if you try to use tricks comparing floating-point numbers with an epsilon. Because the guarantee that a == b and b == c implies a == c isn't there anymore. For hash tables, the exact same problem. When equality isn't transitive, your algorithms just break. – Aerodonetics 14/2, 2015 at 19:4

@tmyklebu, this is how it is implemented in Java (haven't checked ALL implementations, but familiar with many of them). What hashCode method will be called by default on double key? Have you heard about auto-boxing? – Jehanna 14/2, 2015 at 20:33

@aviad: Have you heard about programming languages other than Java? – Mciver 14/2, 2015 at 20:47

If, in this question, the numerator were one, then the only valid error value (that is, one that would only fall into the "return zero instead of INF" branch if it would otherwise return INF) would be zero. So, this question itself is a good example of the "quite often" where an epsilon error value makes no sense. – Roue 15/2, 2015 at 4:14

@tmyklebu, yes. Few of them. However,the question is about Java. as it is stated in the last line. quote: 'Note that this question is for Java....' – Jehanna 15/2, 2015 at 11:1

@DewiMorgan: I think you need to pick a number smaller than 1 for that. Something like 2^(-23) ought to work. The smallest subnormal is somewhat smaller than the biggest subnormal is big. – Mciver 15/2, 2015 at 14:34

Based on @malarres response and @Taemyr comment, here is my little contribution:

public double calculation(double a, double b)
{
     double c = 2 / (a - b);

     // Should not have a big cost.
     if (isnan(c) || isinf(c))
     {
         return 0; // A 'whatever' value.
     }
     else
     {
         return c;
     }
}

My point is to says: the easyest way to know if the result of the division is nan or inf is actualy to perform the division.

Landa answered 14/2, 2015 at 18:31 Comment(0)

Division by zero is undefined, since the limit from positive numbers tend to infinity, the limited from negative numbers tend to negative infinity.

Not sure if this is C++ or Java since there is no language tag.

double calculation(double a, double b)
{
     if (a == b)
     {
         return nan(""); // C++

         return Double.NaN; // Java
     }
     else
     {
         return 2 / (a - b);
     }
}

Ossuary answered 18/2, 2015 at 4:48 Comment(0)

The core problem is that computer representation of a double (aka float, or real number in mathematical language) is wrong when you have "too much" decimal, for instance when you deal with double that can't be written as a numerical value (pi or the result of 1/3).

So a==b can't be done with any double value of a and b, how to you deal with a==b when a=0.333 and b=1/3 ? Depending of your OS vs FPU vs number vs language versus count of 3 after 0, you will have true or false.

Anyway if you do "double value calculation" on a computer, you have to deal with accuracy, so instead of doing a==b, you have to do absolute_value(a-b)<epsilon, and epsilon is relative to what you are modeling at that time in your algorithm. You can't have an epsilon value for all of your double comparison.

In brief, when you type a==b, you have a mathemical expression that can't be translated on a computer (for any floating point number).

PS: hum, everything I answer here is yet more or less in others responses and comments.

Interrelate answered 18/2, 2015 at 18:15 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags