Finding the maximum of a floating point counter
Asked Answered
C

3

8

My apologies if this has been asked before, but I cannot find it.

I was wondering if there is a way to calculate the point at which a single precision floating point number that is used as a counter will reach a 'maximum' (the point at which it is no longer able to add another value due to loss of precision).

For example, if I continuously add 0.1f to a float I will eventually reach a point where the value does not change:

const float INCREMENT = 0.1f;
float value = INCREMENT;
float prevVal = 0.0f;

do {
  prevVal = value;
  value += INCREMENT;
} while (value != prevVal);

cout << value << endl;

On GCC this outputs 2.09715e+06

Is there a way to compute this mathematically for different values of INCREMENT? I believe it should in theory be when the exponent portion of the float requires a shift beyond 23 bits, resulting in losing the mantissa and simply adding 0.

Chamfron answered 7/11, 2018 at 16:6 Comment(5)
Have you checked std::numeric_limit to see if it have something you could use?Windowsill
This seems to be asking how to make a float act like an int. If that's the case, the answer is simple: use int.Finnigan
@Someprogrammerdude I've just been looking and there's nothing out of the box, unless I'm missing it. Could probably come up with some hackery using a combination of its members thoughSpacecraft
The question is valid but in this particular use case I would agree that you should be using int (with a 10x multiplier)Spacecraft
Agreed that in a practical scenario it is more reasonable to use an integer and simply use fixed-point math, but I was curious if there was a way to determine this.Chamfron
J
1

Given some positive y used as an increment, the smallest X for which adding y does not produce a result greater than X is the least power of 2 not less than y divided by half the “epsilon” of the floating-point format. It can be calculated by:

Float Y = y*2/std::numeric_limits<Float>::epsilon();
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
if (X < Y) X *= 2;

A proof follows. I assume IEEE-754 binary floating-point arithmetic using round-to-nearest-ties-to-even.

When two numbers are added in IEEE-754 floating-point arithmetic, the result is the exact mathematical result rounded to the nearest representable value in a selected direction.

A note about notation: Text in source code format represents floating-point values and operations. Other text is mathematical. So x+y is the exact mathematical sum of x and y, x is x in floating-point format, and x+y is the result of adding x and y in a floating-point operation. Also, I will use Float for the floating-point type in C++.

Given a floating-point number x, consider adding a positive value y using floating-point arithmetic, x+y. Under what conditions will the result exceed x?

Let x1 be the next value greater than x representable in the floating-point format, and let xm be the midpoint between x and x1. If the mathematical value of x+y is less than xm, then the floating-point calculation x+y rounds down, so it produces x. If x+y is greater than xm, either it rounds up and produces x1, or it produces some greater number because y is large enough to move the sum beyond x1. If x+y equals xm, the result is whichever of x or x1 has an even low digit. For reasons we will see, this is always x in the situations relevant to this question, so the calculation rounds down.

Therefore, x+y produces a result greater than x if and only if x+y exceeds xm, meaning that y exceeds half the distance from x to x1. Note that the distance from x to x1 is the value of 1 in the low digit of the significand of x.

In a binary floating-point format with p digits in its significand, the position value of the low digit is 21−p times the position value of the high digit. For example, if x is 2e, the highest bit in its significand represents 2e, and the lowest bit represents 2e+1−p.

The question asks, given a y, what is the least x for which x+y does not produce a result greater than x? It is the least x for which y does not exceed half the value of the low digit of the significand of x.

Let 2e be the position value of the high bit of the significand of x. Then y ≤ ½•2e+1−p = 2ep, so y•2p ≤ 2e.

Therefore, given some positive y, the least x for which x+y does not produce a result greater than x has its leading bit, 2e, equal to or exceeding y•2p. And in fact it must be exactly 2e because all other floating-point numbers whose leading bit has position value 2e have other bits set in their significands, so they are greater. 2e is the least number for which the leading bit represents 2e.

Therefore, x is the least power of two that equals or exceeds y•2p.

In C++, std::numeric_limits<Float>::epsilon() (from the <limits> header) is the step from 1 to the next representable value, meaning it is 21−p. So y•2p equals y*2/std::numeric_limits<Float>::epsilon(). (This operation is exact unless it overflows to ∞.)

Let’s assign this to a variable:

Float Y = y*2/std::numeric_limits<Float>::epsilon();

We can find the position value represented by the highest bit of Y’s significand by using frexp (from the <cmath> header) to extract the exponent from the floating-point representation of Y and ldexp (also <cmath>) to apply that exponent to a new significand (.5 because of the scale that frexp and ldexp use):

int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);

Then X is a power of two, and it is less than or equal to Y. It is in fact the greatest power of two not greater than Y, since the next greater power of 2, 2X, is greater than Y. However, we want the least power of two not less than Y. We can find this with:

if (X < Y) X *= 2;

The resulting X is the number sought by the question.

Jocasta answered 7/11, 2018 at 18:32 Comment(6)
What about denormal numbers?Kwei
@EOF: Y is never subnormal, since, even if y is the smallest representable positive value, multiplying it by 2/epsilon produces a normal value. And whether y is subnormal is irrelevant, as its value partakes in x+y without regard to whether it is normal or subnormal. The only exceptional case is when overflow to infinity occurs while calculating Y or X, and that is okay because then the greatest finite number is too small to be X, so ∞ is the correct answer.Jocasta
This is a great write-up, but I found some of it a bit confusing. The part about finding "the position value represented by the highest bit of Y's significand" does not make sense to me. For example with a y of 0.1f we end up with a Y of 1677721.6. The significand of this is 5033165, and an exponent of 147. X ends up as 1048576, which is essentially the same exponent (147) with significand of 0.Chamfron
@CplClegg: 147 is the encoding of the exponent, not the exponent. The actual exponent is 24. The position value of a bit in the significand is the amount by which a number changes if that bit is changed. For example, if the exponent is 24, changing the leading bit of the significand changes the value by 2^24, so its position value is 2^24. Changing the lowest bit of the significand changes the value by 2^1, so its position value is 2^1. (Note that, just as the exponent is encoded, the actual 24-bit significand is encoded with 23 in a significand field and 1 derived from the exponent field.)Jocasta
Yes, I gave the encoded value. The real exponent is 20, not 24 (147 - 127). Also, the leading bit (I assume this is the same as the most significant bit) would change the value of Y by 2^19, not 2^20. An easier example would be a Y of 1.0. This has an exponent of 0, toggling the most significant bit in the mantissa/significand would add 0.5 or 2^-1. Am I off somewhere?Chamfron
@CplClegg: Sorry, I was looking at the wrong number. Yes, for .1, Y is about 1,677,721.6. The value of its leading bit is 2^20. For 1.0, the value of its leading bit is 2^0. The significand is the full significand, the s in 1.0 = + s • 2^0—all 24 bits. The 23 bits in the significand field are not the full significand, just the last 23 bits. The significand field only encodes (most of) the significand; it is not the significand.Jocasta
J
1

Yes it is possible. there is std::numeric_limits::epsilon() which defines smallest value which can increase value 1.0.

Using this you can calculate this limit for any number.

In C there is DBL_EPSILON

So in your case this goes like this:

template<class T>
auto maximumWhenAdding(T delta) -> T
{
    static_assert(std::is_floating_point_v<T>, "Works only for floating points.");
    int power2= std::ilogb(delta);
    float roudedDelta = ldexp(T { 1.0 }, power2);
    if (roudedDelta != delta) {
        roudedDelta *= 2;
    }

    return 2 * roudedDelta / std::numeric_limits<T>::epsilon();
}

live example C++

Note in live test examples delta fails to increase maxForDelta, but subtraction is successful, so this is exactly what you need.

Jeaninejeanlouis answered 7/11, 2018 at 16:12 Comment(10)
but using epsilon he can calculate value he needs.Jeaninejeanlouis
This does not give the correct answer. I am not sure how I would answer my question using epsilon, as it is only the smallest number that can be added to 1.0f, which doesn't seem relevant to finding the point at which value += 0.1f stops working.Chamfron
in original answer there was rounding issue. Now it is fixed and code providing correctness added.Jeaninejeanlouis
This works for most numbers, but does not work for numbers that have a mantissa of zero. For example 1.0f yields 33554432, which is a power of 2 off. Also, I am unclear as to why this works.Chamfron
Subtracting numeric_limits::epsilon() from delta prior to using frexp seems to solve the case 1.0fChamfron
The opening paragraph in this answer is wrong. epsilon is the distance between 1 and the next representable value. That is not the smallest value that can be added to 1 to increase it. Adding any number greater than ½ of epsilon will produce a result greater than 1.Jocasta
@EricPostpischil you are right, but still this is a good track. Just all corner cases have to be taken into account. I will polish that tomorrow.Jeaninejeanlouis
The code in this answer produces an incorrect result when delta is a power of two.Jocasta
@MarekR -- "just all corner cases have to be taken into account" (emphasis added) -- that's always the issue in floating-point math: there are many corner cases. <g>Finnigan
@PeteBecker That certainly is the case. Note that I didn't ask how to do this programmatically, though I appreciate the effort. Simple math is fine. The code for this ends up being complex if you are trying to deal with a delta that can be any value.Chamfron
J
1

Given some positive y used as an increment, the smallest X for which adding y does not produce a result greater than X is the least power of 2 not less than y divided by half the “epsilon” of the floating-point format. It can be calculated by:

Float Y = y*2/std::numeric_limits<Float>::epsilon();
int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);
if (X < Y) X *= 2;

A proof follows. I assume IEEE-754 binary floating-point arithmetic using round-to-nearest-ties-to-even.

When two numbers are added in IEEE-754 floating-point arithmetic, the result is the exact mathematical result rounded to the nearest representable value in a selected direction.

A note about notation: Text in source code format represents floating-point values and operations. Other text is mathematical. So x+y is the exact mathematical sum of x and y, x is x in floating-point format, and x+y is the result of adding x and y in a floating-point operation. Also, I will use Float for the floating-point type in C++.

Given a floating-point number x, consider adding a positive value y using floating-point arithmetic, x+y. Under what conditions will the result exceed x?

Let x1 be the next value greater than x representable in the floating-point format, and let xm be the midpoint between x and x1. If the mathematical value of x+y is less than xm, then the floating-point calculation x+y rounds down, so it produces x. If x+y is greater than xm, either it rounds up and produces x1, or it produces some greater number because y is large enough to move the sum beyond x1. If x+y equals xm, the result is whichever of x or x1 has an even low digit. For reasons we will see, this is always x in the situations relevant to this question, so the calculation rounds down.

Therefore, x+y produces a result greater than x if and only if x+y exceeds xm, meaning that y exceeds half the distance from x to x1. Note that the distance from x to x1 is the value of 1 in the low digit of the significand of x.

In a binary floating-point format with p digits in its significand, the position value of the low digit is 21−p times the position value of the high digit. For example, if x is 2e, the highest bit in its significand represents 2e, and the lowest bit represents 2e+1−p.

The question asks, given a y, what is the least x for which x+y does not produce a result greater than x? It is the least x for which y does not exceed half the value of the low digit of the significand of x.

Let 2e be the position value of the high bit of the significand of x. Then y ≤ ½•2e+1−p = 2ep, so y•2p ≤ 2e.

Therefore, given some positive y, the least x for which x+y does not produce a result greater than x has its leading bit, 2e, equal to or exceeding y•2p. And in fact it must be exactly 2e because all other floating-point numbers whose leading bit has position value 2e have other bits set in their significands, so they are greater. 2e is the least number for which the leading bit represents 2e.

Therefore, x is the least power of two that equals or exceeds y•2p.

In C++, std::numeric_limits<Float>::epsilon() (from the <limits> header) is the step from 1 to the next representable value, meaning it is 21−p. So y•2p equals y*2/std::numeric_limits<Float>::epsilon(). (This operation is exact unless it overflows to ∞.)

Let’s assign this to a variable:

Float Y = y*2/std::numeric_limits<Float>::epsilon();

We can find the position value represented by the highest bit of Y’s significand by using frexp (from the <cmath> header) to extract the exponent from the floating-point representation of Y and ldexp (also <cmath>) to apply that exponent to a new significand (.5 because of the scale that frexp and ldexp use):

int e;
std::frexp(Y, &e);
Float X = std::ldexp(.5, e);

Then X is a power of two, and it is less than or equal to Y. It is in fact the greatest power of two not greater than Y, since the next greater power of 2, 2X, is greater than Y. However, we want the least power of two not less than Y. We can find this with:

if (X < Y) X *= 2;

The resulting X is the number sought by the question.

Jocasta answered 7/11, 2018 at 18:32 Comment(6)
What about denormal numbers?Kwei
@EOF: Y is never subnormal, since, even if y is the smallest representable positive value, multiplying it by 2/epsilon produces a normal value. And whether y is subnormal is irrelevant, as its value partakes in x+y without regard to whether it is normal or subnormal. The only exceptional case is when overflow to infinity occurs while calculating Y or X, and that is okay because then the greatest finite number is too small to be X, so ∞ is the correct answer.Jocasta
This is a great write-up, but I found some of it a bit confusing. The part about finding "the position value represented by the highest bit of Y's significand" does not make sense to me. For example with a y of 0.1f we end up with a Y of 1677721.6. The significand of this is 5033165, and an exponent of 147. X ends up as 1048576, which is essentially the same exponent (147) with significand of 0.Chamfron
@CplClegg: 147 is the encoding of the exponent, not the exponent. The actual exponent is 24. The position value of a bit in the significand is the amount by which a number changes if that bit is changed. For example, if the exponent is 24, changing the leading bit of the significand changes the value by 2^24, so its position value is 2^24. Changing the lowest bit of the significand changes the value by 2^1, so its position value is 2^1. (Note that, just as the exponent is encoded, the actual 24-bit significand is encoded with 23 in a significand field and 1 derived from the exponent field.)Jocasta
Yes, I gave the encoded value. The real exponent is 20, not 24 (147 - 127). Also, the leading bit (I assume this is the same as the most significant bit) would change the value of Y by 2^19, not 2^20. An easier example would be a Y of 1.0. This has an exponent of 0, toggling the most significant bit in the mantissa/significand would add 0.5 or 2^-1. Am I off somewhere?Chamfron
@CplClegg: Sorry, I was looking at the wrong number. Yes, for .1, Y is about 1,677,721.6. The value of its leading bit is 2^20. For 1.0, the value of its leading bit is 2^0. The significand is the full significand, the s in 1.0 = + s • 2^0—all 24 bits. The 23 bits in the significand field are not the full significand, just the last 23 bits. The significand field only encodes (most of) the significand; it is not the significand.Jocasta
C
0

Marek's Answer is pretty close, and a decent way to find it using a program (that is more efficient than the one I originally posted). However, I don't necessarily need the answer in a program form, just a mathematical one.

From what I can tell, the answer comes down to the exponent of the delta used, and the number of mantissa bits. We need to round to the nearest power of 2, which is kind of complicated. Basically if the mantissa is 0, we do nothing, otherwise we add 1 to the exponent. So, assuming we now have the delta as a power of 2, represented as 1.0 x 2exp, and a mantissa of N bits, the maximum value is 1.0 x 2(N + exp). Note that FLT_EPSILON in C is equal to 1.0 x 2-N. So we can also find this by dividing our nearest power of 2 by FLT_EPSILON.

For a delta of 0.1, the nearest power of 2 is 0.125, or 1.0 x 2-3. Therefore we want 1.0 x 2(23 + (-3)) or 1.0 x 221 which is equal to 2097152.

Chamfron answered 7/11, 2018 at 18:55 Comment(1)
Wrote this before Eric's answer was posted. I've selected his as its a more thorough answer that I think gets to the root of the question.Chamfron

© 2022 - 2024 — McMap. All rights reserved.