Exact value of a floating-point number as a rational

Asked 2/7, 2018 at 19:0 Answered 2/7, 2018 at 19:44

Solved c++floating-point numerical-methods rational-number continued-fractions

I'm looking for a method to convert the exact value of a floating-point number to a rational quotient of two integers, i.e. a / b, where b is not larger than a specified maximum denominator b_max. If satisfying the condition b <= b_max is impossible, then the result falls back to the best approximation which still satisfies the condition.

Hold on. There are a lot of questions/answers here about the best rational approximation of a truncated real number which is represented as a floating-point number. However I'm interested in the exact value of a floating-point number, which is itself a rational number with a different representation. More specifically, the mathematical set of floating-point numbers is a subset of rational numbers. In case of IEEE 754 binary floating-point standard it is a subset of dyadic rationals. Anyway, any floating-point number can be converted to a rational quotient of two finite precision integers as a / b.

So, for example assuming IEEE 754 single-precision binary floating-point format, the rational equivalent of float f = 1.0f / 3.0f is not 1 / 3, but 11184811 / 33554432. This is the exact value of f, which is a number from the mathematical set of IEEE 754 single-precision binary floating-point numbers.

Based on my experience, traversing (by binary search of) the Stern-Brocot tree is not useful here, since that is more suitable for approximating the value of a floating-point number, when it is interpreted as a truncated real instead of an exact rational.

Possibly, continued fractions are the way to go.

The another problem here is integer overflow. Think about that we want to represent the rational as the quotient of two int32_t, where the maximum denominator b_max = INT32_MAX. We cannot rely on a stopping criterion like b > b_max. So the algorithm must never overflow, or it must detect overflow.

What I found so far is an algorithm from Rosetta Code, which is based on continued fractions, but its source mentions it is "still not quite complete". Some basic tests gave good results, but I cannot confirm its overall correctness and I think it can easily overflow.

// https://rosettacode.org/wiki/Convert_decimal_number_to_rational#C

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <stdint.h>

/* f : number to convert.
 * num, denom: returned parts of the rational.
 * md: max denominator value.  Note that machine floating point number
 *     has a finite resolution (10e-16 ish for 64 bit double), so specifying
 *     a "best match with minimal error" is often wrong, because one can
 *     always just retrieve the significand and return that divided by 
 *     2**52, which is in a sense accurate, but generally not very useful:
 *     1.0/7.0 would be "2573485501354569/18014398509481984", for example.
 */
void rat_approx(double f, int64_t md, int64_t *num, int64_t *denom)
{
    /*  a: continued fraction coefficients. */
    int64_t a, h[3] = { 0, 1, 0 }, k[3] = { 1, 0, 0 };
    int64_t x, d, n = 1;
    int i, neg = 0;

    if (md <= 1) { *denom = 1; *num = (int64_t) f; return; }

    if (f < 0) { neg = 1; f = -f; }

    while (f != floor(f)) { n <<= 1; f *= 2; }
    d = f;

    /* continued fraction and check denominator each step */
    for (i = 0; i < 64; i++) {
        a = n ? d / n : 0;
        if (i && !a) break;

        x = d; d = n; n = x % n;

        x = a;
        if (k[1] * a + k[0] >= md) {
            x = (md - k[0]) / k[1];
            if (x * 2 >= a || k[1] >= md)
                i = 65;
            else
                break;
        }

        h[2] = x * h[1] + h[0]; h[0] = h[1]; h[1] = h[2];
        k[2] = x * k[1] + k[0]; k[0] = k[1]; k[1] = k[2];
    }
    *denom = k[1];
    *num = neg ? -h[1] : h[1];
}

Devastation answered 2/7, 2018 at 19:0 Comment(19)

itu.dk/~sestoft/bachelor/IEEE754_article.pdf – Persistence 2/7, 2018 at 19:2

@πάνταῥεῖ I know this article. How does it help here? – Devastation 2/7, 2018 at 19:3

Representation of 2 based numbers can only represent a subset of all possible rational fractions. – Persistence 2/7, 2018 at 19:4

@πάνταῥεῖ Yes, but that subset still contains valid rational fractions. – Devastation 2/7, 2018 at 19:4

Well, if you could have an arbitrary number of bits for the representation you might be able to represent all of them. May be I didn't understand what you actually want to do? – Persistence 2/7, 2018 at 19:7

I said "This is the exact value of f, which is a number from the mathematical set of IEEE 754 single-precision binary floating-point numbers." The mathematical set of floating-point numbers is a subset of rational numbers. All IEEE 754 floating-point numbers can be represented as a quotient of two integers. There is no need for infinite bits. I already wrote a Wolfram Mathematica function to do that. Unfortunately that is not portable to C++. – Devastation 2/7, 2018 at 19:7

I guess I don’t understand the problem:get the significand, get the exponent: the exact value is the significand shifted left for positive exponent divided by one and otherwise it is the significand divided by 2 shifted by the negative exponent with some small faffing about to account for the digits of the significand. Obviously you’ll need big integers to avoid overflow. – Algonquin 2/7, 2018 at 19:31

You may be interested in Python's Fraction.limit_denominator, which does exactly this. Of course, that doesn't help with the overflow problem in C, but the algorithm may be useful. Specifically, in Python, for a (Python) float x, Fraction(x).limit_denominator(b_max) gives what you want. – Schnabel 2/7, 2018 at 19:31

@MarkDickinson Thanks, I will definitely look into it! – Devastation 2/7, 2018 at 19:32

You should focus on the algorithm IMO. Precision of calculations can be extended( at the cost of memory and runtime) throughout library stuff such as boost::multiprecision. – Oops 2/7, 2018 at 19:32

@DietmarKühl The significand in IEEE 754 is encoded as a binary fraction. Converting that binary fraction to decimal naively is a complete disaster because you have to add k rationals with different denominators, i.e. you have to perform a lot of multiplication and addition (not to mention applying the exponent at the end), which leads to overflow very very soon. Of course you could simplify in each step using the greatest common divisor, but that would make the algorithm very slow. This approach also cannot provide the closest approx if the rational cannot be represented as b <= b_max. – Devastation 2/7, 2018 at 19:37

The significand is an integer. You’ll only need to multiply it by a power of 2 to get the exact value. Clearly, you will need big integers to prevent overflow to represent values with interesting exponents. Converting these to a decimal representation is a separate question and can be done by appropriate divisions by suitable powers of 10 (you can keep using 10 if you are so inclined). Since you asked for an algorithm: that’ll do the trick. Since floating points are fixed sized you don’t need a cariable size implementation which ,akes things a bit more effective. – Algonquin 2/7, 2018 at 19:43

The greatsst common divisor won’t help you much as the denominator is a power of 2, It will at best shift off a few zero bits. You’ll need a large integer representation to get the exact value. – Algonquin 2/7, 2018 at 19:47

@DietmarKühl There is no guarantee that the integers used to represent the quotient a / b will be arbitrary-precision. If they are not, then I am interested in the best approximation which still satisfies b <= b_max, where b_max is the largest value of that integer type. This is why an iterative refinement algorithm is desirable. – Devastation 2/7, 2018 at 19:50

chux has given an answer that may be suitable for IEEE-754 binary formats (I have not examined it yet). For completeness, I will point out that C provides FLT_RADIX, which defines the radix used for the floating-point formats. Multiplying or dividing by FLT_RADIX should be exact (and note the scalbn function), so, given some floating-point value x that is not an integer, one can find the desired numerator by multiplying by FLT_RADIX until the result is an integer, and the denominator is FLT_RADIX to the power of the number of multiplications. – Setter 3/7, 2018 at 1:21

@EricPostpischil Great approach, but what about denormal numbers? Will this work for them as well? You can also encounter too big numbers during the multiplication by FLT_RADIX. – Devastation 3/7, 2018 at 6:37

@plasmacel: Multiplying denormal numbers should work fine. Multiplying by FLT_RADIX until a number is an integer will not overflow in any normal format. Theoretically, one could define a floating-point format such that the exponent is limited to a weird range like -20 to -10 instead of a more normal -126 to 127, but, in practice, exponent ranges allow the significand to be scaled to an integer. – Setter 3/7, 2018 at 11:55

@EricPostpischil I guess this method also guarantees the canonical form of the rational, i.e. it is cannot be further simplified by the greatest common divisor since the GCD is 1. – Devastation 9/7, 2018 at 8:31

@EricPostpischil At least if the radix is a prime number. Otherwise the computed numerator and denominator could be simplified by the greatest common divisor. Think about radix 10. – Devastation 9/7, 2018 at 15:33

All finite double are rational numbers as OP well stated..

Use frexp() to break the number into its fraction and exponent. The end result still needs to use double to represent whole number values due to range requirements. Some numbers are too small, (x smaller than 1.0/(2.0,DBL_MAX_EXP)) and infinity, not-a-number are issues.

The frexp functions break a floating-point number into a normalized fraction and an integral power of 2. ... interval [1/2, 1) or zero ...
C11 §7.12.6.4 2/3

#include <math.h>
#include <float.h>

_Static_assert(FLT_RADIX == 2, "TBD code for non-binary FP");

// Return error flag
int split(double x, double *numerator, double *denominator) {
  if (!isfinite(x)) {
    *numerator = *denominator = 0.0;
    if (x > 0.0) *numerator = 1.0;
    if (x < 0.0) *numerator = -1.0;
    return 1;
  }
  int bdigits = DBL_MANT_DIG;
  int expo;
  *denominator = 1.0;
  *numerator = frexp(x, &expo) * pow(2.0, bdigits);
  expo -= bdigits;
  if (expo > 0) {
    *numerator *= pow(2.0, expo);
  }
  else if (expo < 0) {
    expo = -expo;
    if (expo >= DBL_MAX_EXP-1) {
      *numerator /= pow(2.0, expo - (DBL_MAX_EXP-1));
      *denominator *= pow(2.0, DBL_MAX_EXP-1);
      return fabs(*numerator) < 1.0;
    } else {
      *denominator *= pow(2.0, expo);
    }
  }

  while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) {
    *numerator /= 2.0;
    *denominator /= 2.0;
  }
  return 0;
}

void split_test(double x) {
  double numerator, denominator;
  int err = split(x, &numerator, &denominator);
  printf("e:%d x:%24.17g n:%24.17g d:%24.17g q:%24.17g\n", 
      err, x, numerator, denominator, numerator/ denominator);
}

int main(void) {
  volatile float third = 1.0f/3.0f;
  split_test(third);
  split_test(0.0);
  split_test(0.5);
  split_test(1.0);
  split_test(2.0);
  split_test(1.0/7);
  split_test(DBL_TRUE_MIN);
  split_test(DBL_MIN);
  split_test(DBL_MAX);
  return 0;
}

Output

e:0 x:      0.3333333432674408 n:                11184811 d:                33554432 q:      0.3333333432674408
e:0 x:                       0 n:                       0 d:        9007199254740992 q:                       0
e:0 x:                       1 n:                       1 d:                       1 q:                       1
e:0 x:                     0.5 n:                       1 d:                       2 q:                     0.5
e:0 x:                       1 n:                       1 d:                       1 q:                       1
e:0 x:                       2 n:                       2 d:                       1 q:                       2
e:0 x:     0.14285714285714285 n:        2573485501354569 d:       18014398509481984 q:     0.14285714285714285
e:1 x: 4.9406564584124654e-324 n:  4.4408920985006262e-16 d: 8.9884656743115795e+307 q: 4.9406564584124654e-324
e:0 x: 2.2250738585072014e-308 n:                       2 d: 8.9884656743115795e+307 q: 2.2250738585072014e-308
e:0 x: 1.7976931348623157e+308 n: 1.7976931348623157e+308 d:                       1 q: 1.7976931348623157e+308

Leave the b_max consideration for later.

More expedient code is possible with replacing pow(2.0, expo) with ldexp(1, expo) @gammatester or exp2(expo) @Bob__

while (*numerator && fmod(*numerator,2) == 0 && fmod(*denominator,2) == 0) could also use some performance improvements. But first, let us get the functionality as needed.

Epithelium answered 2/7, 2018 at 19:44 Comment(18)

Nice! I do some tests soon before I accept it. Do you have any idea how to handle "too small" values? – Devastation 2/7, 2018 at 19:57

@Devastation The limitations and even type of b_max warrant further thought. To more precisely render an integer type solution, I'd start with intmax_t num, uintmax_t den; for potentially some more range. – Epithelium 2/7, 2018 at 19:59

@Devastation "too small" --> I think the question could change to "integer / power-of-2" and then range issues are far less of a concern (except then how to code infinity and NaN, hmmm) . – Epithelium 2/7, 2018 at 20:2

Those too small values, more precisely 1.0 / pow(2, DBL_MAX_EXP) are actually denormal I think. – Devastation 2/7, 2018 at 20:4

@Devastation "too small values" are de-normal (or sub-normal) for binary64, yet this answer does not assume that format as C does not specify that. It is just the most common format and is true there. – Epithelium 2/7, 2018 at 20:6

Is it possible for en.cppreference.com/w/c/numeric/math/exp2 to be implemented more efficiently or with more accuracy then a generic pow(2.0,x)? – Cumin 2/7, 2018 at 20:15

@gammatester ldexp() would be useful for the inverse operation, merging the whole number numerator and the log2(denominator) to get back the original x. – Epithelium 2/7, 2018 at 20:24

pow(2, exp) == ldexp(1, exp) – Petersburg 2/7, 2018 at 20:26

@Cumin Yes exp2() would be better in general. A good compile will analyze code such as pow(2.0, expo) and emit the same code as exp2(expo). – Epithelium 2/7, 2018 at 20:33

@Petersburg Yes, good idea to use ldexp(1.0, expo) for pow(2, expo). – Epithelium 2/7, 2018 at 20:40

@chux If I understand well, numbers smaller than 1.0 / pow(2, MAX_EXP) (which are "too small") result in a numerator which is a negative power of 2.0. Such cases could be resolved by multiplying both the numerator and denominator by 1.0 / numerator. Or do that cases already have the maximum representable denominator? – Devastation 2/7, 2018 at 20:48

@Devastation Yes. Using double for the denominator, it can only get so big before "infinity". Thus the test if (expo >= DBL_MAX_EXP-1). Perhaps this code could be refined a bit here, but the point being at some point, denominator is limited to how high it can go. Yes, we hit the maximum representable power-of-2 denominator. – Epithelium 2/7, 2018 at 20:51

@chux Then this is not a real issue here. I assume that if a fixed-precision integer type is used instead of an arbitrary-precision, then converting this double to that type (clamping to the maximum value of the fixed integer type) gives the best possible approximation for that type. If an arbitrary-precision integer type is used instead, then a workaround can be easily made for this issue. – Devastation 2/7, 2018 at 20:56

@chux Is my presumption true that any floating-point number can be represented by a quotient of two floating-point numbers of the same type? And an another optimization idea: wouldn't be division by the greatest common divisor (GCD, implemented by the Euclidean algorithm) be more efficient instead of the while loop at the end? I think it would converge faster. – Devastation 2/7, 2018 at 21:20

Let us continue this discussion in chat. – Epithelium 2/7, 2018 at 21:41

In C++ code, I suggest replacing DBL_MANT_DIG by std::numeric_limits<double>::digits and DBL_MAX_EXP by std::numeric_limits<double>::max_exponent – Pointdevice 22/1, 2023 at 14:51

In C++ code, consider using boost::rational<long>, which normalizes numerator and denominator within its constructor (the double values must be explicitly converted to long, though), thus there is no need for while (*numerator && ... – Pointdevice 22/1, 2023 at 14:54

@TomášKolárik Although std::numeric_limits<double>::digits would make for a better C++ answer, OP, using <*.h> files seems more interested in a C-like answer. Feel open to post a more idiomatic C++ answer. boost:: is not part of standard C++ AFAIK, so its inclusion drifts way from standard C++. Again, you could answer with your keen knowledge of boost. – Epithelium 23/1, 2023 at 4:43

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags