How do you find a float's nearest non-equal value? [duplicate]

Asked 15/9, 2014 at 13:56 Answered 15/9, 2014 at 15:25

Solved c floating-point floating-accuracy

A float (a.k.a. single) value is a 4-byte value, and supposed to represent any real-valued number. Because of the way it is formatted and the finite number of bytes it is made off, there is a minimum value and a maximum value it can represent, and it has a finite precision, depending on it's own value.

I would like to know if there is a way to get the closest possible value above or below some reference value, given the finite precision of a float. With integers, this is trivial: one simply adds or subtracts 1. But with a float, you can't simply add or subtract the minimum float value and expect it to be different from your original value. I.e.

float FindNearestSmaller (const float a)
{
    return a - FLT_MIN; /* This doesn't necessarily work */
}

In fact, the above will almost never work. In the above case, the return will generally still equal a, as the FLT_MIN is far beyond the precision of a. You can easily try this out for yourself: it works for e.g. 0.0f, or for very small numbers of order FLT_MIN, but not for anything between 0 and 100.

So how would you get the value that is closest but smaller or larger than a, given floating point precision?

Note: Though i am mainly interested in a C/C++ answer, I assume the answer will be applicable for most programming languages.

Morpho answered 15/9, 2014 at 13:56 Comment(6)

Really curious to see an answer to this. – Clough 15/9, 2014 at 13:58

memcpy into an integer, add one, memcpy back into float ? – Cyanogen 15/9, 2014 at 13:58

@BartFriederichs: no, that won't work in all cases. – Kajdan 15/9, 2014 at 14:0

Also, stackoverflow.com/a/7408487 – Mansur 15/9, 2014 at 14:6

with float one simply adds one or removes one to the mantissa (and normalizes) to find the next nearest – Heuristic 15/9, 2014 at 14:52

There are already nextafter for you to use #16322965 #11160274 #10160579 – Everest 15/9, 2014 at 15:24

The standard way to find a floating-point value's neighbors is the function nextafter for double and nextafterf for float. The second argument gives the direction. Remember that infinities are legal values in IEEE 754 floating-point, so you can very well call nextafter(x, +1.0/0.0) to get the value immediately above x, and this will work even for DBL_MAX (whereas if you wrote nextafter(x, DBL_MAX), it would return DBL_MAX when applied for x == DBL_MAX).

Two non-standard ways that are sometimes useful are:

access the representation of the float/double as an unsigned integer of the same size, and increment or decrement this integer. The floating-point format was carefully designed so that for positive floats, and respectively for negative floats, the bits of the representation, seen as an integer, evolve monotonously with the represented float.
change the rounding mode to upward, and add the smallest positive floating-point number. The smallest positive floating-point number is also the smallest increment that there can be between two floats, so this will never skip any float. The smallest positive floating-point number is FLT_MIN * FLT_EPSILON.

For the sake of completeness, I will add that even without changing the rounding mode from its “to nearest” default, multiplying a float by (1.0f + FLT_EPSILON) produces a number that is either the immediate neighbor away from zero, or the neighbor after that. It is probably the cheapest if you already know the sign of the float you wish to increase/decrease and you don't mind that it sometimes does not produce the immediate neighbor. Functions nextafter and nextafterf are specified in such a way that a correct implementation on the x86 must test for a number of special values and FPU states, and is thus rather costly for what it does.

To go towards zero, multiply by 1.0f - FLT_EPSILON.

This doesn't work for 0.0f, obviously, and generally for the smaller denormalized numbers.

The values for which multiplying by 1.0f + FLT_EPSILON advance by 2 ULPS are just below a power of two, specifically in the interval [0.75 * 2^p … 2^p). If you don't mind doing a multiplication and an addition, x + (x * (FLT_EPSILON * 0.74)) should work for all normal numbers (but still not for zero nor for all the small denormal numbers).

Facer answered 15/9, 2014 at 14:10 Comment(0)

Look at the "nextafter" function, which is part of Standard C (and probably C++, but I didn't check).

Sherleysherline answered 15/9, 2014 at 13:59 Comment(2)

linux.die.net/man/3/nextafter – Kajdan 15/9, 2014 at 14:0

There's a std::nextafter and std::nexttoward starting in c++11. – Metaphosphate 15/9, 2014 at 14:10

I tried it out on my machine. And all three approaches:
1. adding with 1 and memcopying
2. adding FLT_EPSILON
3. multiplying by (1.0f + FLT_EPSILON)
seems to give the same answer.

see the result here
bash-3.2$ cc float_test.c -o float_test; ./float_test 1.023456 10
Original num: 1.023456
int added = 1.023456 01-eps added = 1.023456 mult by 01*(eps+1) = 1.023456
int added = 1.023456 02-eps added = 1.023456 mult by 02*(eps+1) = 1.023456
int added = 1.023456 03-eps added = 1.023456 mult by 03*(eps+1) = 1.023456
int added = 1.023456 04-eps added = 1.023456 mult by 04*(eps+1) = 1.023456
int added = 1.023457 05-eps added = 1.023457 mult by 05*(eps+1) = 1.023457
int added = 1.023457 06-eps added = 1.023457 mult by 06*(eps+1) = 1.023457
int added = 1.023457 07-eps added = 1.023457 mult by 07*(eps+1) = 1.023457
int added = 1.023457 08-eps added = 1.023457 mult by 08*(eps+1) = 1.023457
int added = 1.023457 09-eps added = 1.023457 mult by 09*(eps+1) = 1.023457
int added = 1.023457 10-eps added = 1.023457 mult by 10*(eps+1) = 1.023457

Code

#include <float.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

int main(int argc, char *argv[])
{

    if(argc != 3) {
        printf("Usage: <binary> <floating_pt_num> <num_iter>\n");
        exit(0);
    }

    float f = atof(argv[1]);
    int count = atoi(argv[2]);

    assert(count > 0);

    int i;
    int num;
    float num_float;

    printf("Original num: %f\n", f);
    for(i=1; i<=count; i++) {
        memcpy(&num, &f, 4);
        num += i;
        memcpy(&num_float, &num, 4);
        printf("int added = %f \t%02d-eps added = %f \tmult by %2d*(eps+1) = %f\n", num_float, i, f + i*FLT_EPSILON, i, f*(1.0f + i*FLT_EPSILON));
    }

    return 0;
}

Niello answered 15/9, 2014 at 15:25 Comment(11)

%.9f is better than %f to show what is happening, and %a is even better. – Facer 15/9, 2014 at 17:58

Suggest testing against numbers on just either side of a power-of-2 (and away from 1.0, maybe -) like -1023.9 and -1024.1. Confident significant differences will be shown. – Rebuttal 15/9, 2014 at 18:14

@chux well that is expected. With larger numbers multiplying by 1+FLT_EPSILON will give a larger drift. Also the add-by-1 integer method will also affect the mantissa by a larger amount; in some cases like 2^30 changing the exponent as well. But adding by FLT_EPSILON still produces consistent results. Actually now the changes are even slower because we are changing bigger numbers; so effect of small FLT_EPSILON is shadowed. – Niello 15/9, 2014 at 18:33

f + i*FLT_EPSILON will clearly not work properly for values of f larger than 4. No one said that f*(1.0f + i*FLT_EPSILON) worked for i larger than 1. Interestingly, the num += i; method does work for arbitrary values of i as long as the float does not overflow. – Facer 15/9, 2014 at 18:34

@PascalCuoq Actually, num += i won't give you the correct answer when the test number gets bigger as I already explained that adding 1 will have a magnified effect on the test number. For instance, with 256 as test number, I get 256.000031 as the next number which can't be correct as there are at least 30 other visible floats in between. – Niello 15/9, 2014 at 18:46

The next larger float after 256.0f is 256.0000305...f. Ref: binary32 Adding 1 should have a magnified effect as f is larger. Disagree that there are at least 30 visible floats in between. – Rebuttal 15/9, 2014 at 18:56

"Adding 1" will work for typical float (binary32) from +0.0f to FLT_MAX. – Rebuttal 15/9, 2014 at 19:3

What makes you think that there are “at least 30 other visible floats” between 256 and 256.000031? What are the bit representations of these floats? What do they print when passed to printf("%.9e\n", (float) X)? – Facer 15/9, 2014 at 19:6

@chux corrected the priting issue. though i the test results were from updated code; so they were correct. BTW can't I declare float f = 256.00001 ? – Niello 15/9, 2014 at 19:7

@chux yes you are right... printing 256.000001 prints 256.00000 and printing 256.00002 prints 256.000031. Thanks for clarifying to me. – Niello 15/9, 2014 at 19:13

@Nikhil Vidhani Note: Using to printf("%f", f) to print the next floating point number is problematic for large and small float due to the non-exponential notation and limited number of decimal places. To get a clearer idea of the nature of the next float, use printf("%.*e", FLT_DECIMAL_DIG - 1, f). This will print FLT_DECIMAL_DIG digits, guaranteed by the C spec just precise enough to show the difference between any two adjacent floats. – Rebuttal 15/9, 2014 at 19:22

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags