Efficient way to round double precision numbers to a lower precision given in number of bits

Asked 11/1, 2013 at 19:41 Answered 11/1, 2013 at 20:10

In C#, I want to round doubles to a lower precision so that I can store them in buckets of varying size in an associative array. Unlike the usual rounding, I want to round to a number of significant bits. Thus large numbers will change in absolute terms much more than small numbers, but they will tend to change the same proportionately. So if I want to round to 10 binary digits, I find the ten most significant bits, and zero out all the lower bits, possibly adding a small number for rounding up.

I prefer "half-way" numbers be rounded up.

If it were an integer type, here would be a possible algorithm:

  1. Find: zero-based index of the most significant binary digit set H.
  2. Compute: B = H - P, 
       where P is the number of significant digits of precision to round
       and B is the binary digit to start rounding, where B = 0 is the ones place, 
       B = 1 is the twos place, etc. 
  3. Add: x = x + 2^B 
       This will force a carry if necessary (we round halfway values up).
  4. Zero out: x = x mod 2^(B+1). 
       This clears the B place and all lower digits.

The problem is finding an efficient way to find the highest bit set. If I were using integers, there are cool bit hacks to find the MSB. I do not want to call Round(Log2(x)) if I can help it. This function will be called many millions of times.

Note: I have read this SO question:

What is a good way to round double-precision values to a (somewhat) lower precision?

It works for C++. I am using C#.

UPDATE:

This is the code (modified from what the answerer supplied) as I am using it:

/// <summary>
/// Round numbers to a specified number of significant binary digits.
/// 
/// For example, to 3 places, numbers from zero to seven are unchanged, because they only require 3 binary digits,
/// but larger numbers lose precision:
/// 
///      8    1000 => 1000   8
///      9    1001 => 1010  10
///     10    1010 => 1010  10
///     11    1011 => 1100  12
///     12    1100 => 1100  12
///     13    1101 => 1110  14
///     14    1110 => 1110  14
///     15    1111 =>10000  16
///     16   10000 =>10000  16
///     
/// This is different from rounding in that we are specifying the place where rounding occurs as the distance to the right
/// in binary digits from the highest bit set, not the distance to the left from the zero bit.
/// </summary>
/// <param name="d">Number to be rounded.</param>
/// <param name="digits">Number of binary digits of precision to preserve. </param>
public static double AdjustPrecision(this double d, int digits)
{
    // TODO: Not sure if this will work for both normalized and denormalized doubles. Needs more research.
    var shift = 53 - digits; // IEEE 754 doubles have 53 bits of significand, but one bit is "implied" and not stored.
    ulong significandMask = (0xffffffffffffffffUL >> shift) << shift;
    var local_d = d;
    unsafe
    {
        // double -> fixed point (sorta)
        ulong toLong = *(ulong*)(&local_d);
        // mask off your least-sig bits
        var modLong = toLong & significandMask;
        // fixed point -> float (sorta)
        local_d = *(double*)(&modLong);
    }
    return local_d;
}

UPDATE 2: Dekker's Algorithm

I derived this from Dekker's algorithm, thanks to the other respondent. It rounds to the closest value, instead of truncating as the above code does, and it uses only safe code:

private static double[] PowersOfTwoPlusOne;

static NumericalAlgorithms()
{
    PowersOfTwoPlusOne = new double[54];
    for (var i = 0; i < PowersOfTwoPlusOne.Length; i++)
    {
        if (i == 0)
            PowersOfTwoPlusOne[i] = 1; // Special case.
        else
        {
            long two_to_i_plus_one = (1L << i) + 1L;
            PowersOfTwoPlusOne[i] = (double)two_to_i_plus_one;
        }
    }
}

public static double AdjustPrecisionSafely(this double d, int digits)
{
    double t = d * PowersOfTwoPlusOne[53 - digits];
    double adjusted = t - (t - d);
    return adjusted;
}

UPDATE 2: TIMINGS

I ran a test and found that Dekker's algorithm is better than TWICE as fast!

Number of calls in test: 100,000,000
Unsafe Time = 1.922 (sec)
Safe Time = 0.799 (sec)

Deidradeidre answered 11/1, 2013 at 19:41 Comment(4)

After reading What is a good way to round double-precision values to a (somewhat) lower precision? what have you come up with in terms of coding a solution.. why not try what you have read and if it needs to be refactored or optimized then report back here with results and or errors.. unless you already have code then post it along with your original question – Bluet 11/1, 2013 at 19:49

You might need to specific as to what is the problem implementing the SO post you've linked to. – Merca 11/1, 2013 at 19:53

Of course. The othe rpost is for C++, and calls ldexp and frexp functions, for which I do not know an analog in C#. – Deidradeidre 11/1, 2013 at 20:12

Aren't Scalb(y, n) and Logb(x) the equivalent of scalb ilogb ? If yes they have the capabilities of frexp and ldexp with slightly different API... – Doorstone 11/1, 2013 at 20:35

Dekker’s algorithm will split a floating-point number into high and low parts. If there are s bits in the significand (53 in IEEE 754 64-bit binary), then *x0 receives the high s-b bits, which is what you requested, and *x1 receives the remaining bits, which you may discard. In the code below, Scale should have the value 2^b. If b is known at compile time, e.g., the constant 43, you can replace Scale with 0x1p43. Otherwise, you must produce 2^b in some way.

This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even, which is not what you requested (ties upward). Is that necessary?

This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in double precision (not greater).

void Split(double *x0, double *x1, double x)
{
    double d = x * (Scale + 1);
    double t = d - x;
    *x0 = d - t;
    *x1 = x - *x0;
}

Biscay answered 11/1, 2013 at 20:2 Comment(14)

Rounding to even will be OK if everything else works. I will look at Dekker's algorithm as you suggest. – Deidradeidre 11/1, 2013 at 20:9

Your note lead me to a paper called "ACCURATE FLOATING-POINT SUMMATION" by Rump, Ogita, and Oishi. Pure magic! – Deidradeidre 11/1, 2013 at 20:35

I may eventually go with this approach, once I understand how it works! – Deidradeidre 11/1, 2013 at 23:47

It works thusly: First, we want x*Scale, because this has a high bit where we want it to make a later rounding happen at the right spot. But we actually want a slightly greater value that allows us to subtract x and still have that same high bit. So we calculate x*Scale+x, which is x*(Scale+1). That is d. Next, observe that d-(d-x) would be exactly x if computed exactly. However, d-x is performed in floating point, so it is rounded. d is greater than x, so the low bits of x are below the significand. So d-(d-x) gives us x with those low bits rounded away. – Biscay 12/1, 2013 at 0:40

I tested this algorithm, and it rounds to the closest value, which is better thatn the unsafe code I used. I am about to test the speed. – Deidradeidre 23/1, 2013 at 13:14

Your method (Dekker's algorithm) is better than twice as fast as the unsafe code. Great answer. – Deidradeidre 23/1, 2013 at 13:41

I am not aware of Dekker's algorithm. Do you have a source that discusses it? – Triboelectricity 28/4, 2015 at 12:48

The problem is that there is another Decker's algorithm en.wikipedia.org/wiki/Dekker%27s_algorithm – Triboelectricity 28/4, 2015 at 12:51

1) double d = x * (Scale + 1); needs additional code to handle potential overflow (alluded to in the answer). 2) double d = x * (Scale + 1); may incur a rounding error which may be important on edge cases. – Gerhardine 20/12, 2017 at 0:7

@chux: x * (Scale + 1) is intended to round, with the subsequent operations separating out the change caused by that rounding. Do you perceive a case where this is incorrect? Can you show an example with specific values? – Biscay 20/12, 2017 at 0:30

@EricPostpischil I failed to find a failing case, aside from overflow issues. Excuse me while I have a slice of humble pie – Gerhardine 20/12, 2017 at 2:40

I implemented this algorithm on Java and it failed on one pathological case I've tried, namely 1.026606072791026E308, which has a binary representation of 0 11111111110 0010010001100011000110011111011100100100111000111011, with b = 29. I'll see if I can reproduce the problem in, and port my solution to .NET. – Longsighted 11/2, 2018 at 2:56

@KevinJin: As stated in the answer, the code shown assumes that x * (Scale + 1) does not overflow. To handle cases close to the upper limit of the format, you can simply scale down, e.g., by multiplying by 2**-100, apply the algorithm, and scale back up, e.g., by multiplying by 2**100. Note that the correct result may still be infinity, if the number was such that rounding it up at the selected position causes it to overflow the finite range. – Biscay 11/2, 2018 at 3:8

I see. There probably is some constant threshold at which the number first needs to be scaled down and then scaled back up. I'll see if I can find that threshold so that I can add a conditional branch that will make this solution work for the entire representable range. – Longsighted 11/2, 2018 at 3:12

Interesting...never heard of a need for this, but I think you can "do it" via some funky unsafe code...

void Main()
{
    // how many bits you want "saved"
    var maxBits = 20;

    // create a mask like 0x1111000 where # of 1's == maxBits
    var shift = (sizeof(int) * 8) - maxBits;
    var maxBitsMask = (0xffffffff >> shift) << shift;

    // some floats
    var floats = new []{ 1.04125f, 2.19412347f, 3.1415926f};
    foreach (var f in floats)
    {
        var localf = f;
        unsafe
        {
            // float -> fixed point (sorta)
            int toInt = *(int*)(&localf);
            // mask off your least-sig bits
            var modInt = toInt & maxBitsMask;
            // fixed point -> float (sorta)
            localf = *(float*)(&modInt);
        }
        Console.WriteLine("Was {0}, now {1}", f, localf);
    }
}

And with doubles:

void Main()
{
    var maxBits = 50;
    var shift = (sizeof(long) * 8) - maxBits;
    var maxBitsMask = (0xffffffffffffffff >> shift) << shift;
    var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
    foreach (var d in doubles)
    {
        var local = d;
        unsafe
        {
            var toLong = *(ulong*)(&local);
            var modLong = toLong & maxBitsMask;
            local = *(double*)(&modLong);
        }
        Console.WriteLine("Was {0}, now {1}", d, local);
    }
}

Aww...I got unaccepted. :)

For completeness, here's using Jeppe's "unsafe-free" approach:

void Main()
{
    var maxBits = 50;
    var shift = (sizeof(long) * 8) - maxBits;
    var maxBitsMask = (long)((0xffffffffffffffff >> shift) << shift);
    var doubles = new []{ 1412.04125, 22.19412347, 3.1415926};
    foreach (var d in doubles)
    {
        var local = d;
        var asLong = BitConverter.DoubleToInt64Bits(d);
        var modLong = asLong & maxBitsMask;
        local = BitConverter.Int64BitsToDouble(modLong);
        Console.WriteLine("Was {0}, now {1}", d, local);
    }
}

Wiedmann answered 11/1, 2013 at 20:10 Comment(13)

Intriguing. You use floats. Does anything have to change for doubles? And does this work for both normalized and unnormalized doubles? – Deidradeidre 11/1, 2013 at 22:2

@PaulChernoch Oh, as for normalized/unnormalized, not a bloody clue. :) – Wiedmann 11/1, 2013 at 22:55

I figured out the double thing. It works! I do not know about normalized/unnormalized yet, but until I can do mre reserach, I will use this. Thanks! (BTW: I use shift = 53 - digits, since IEEE 754 has 53 bits of mantissa, even if one is implied. – Deidradeidre 11/1, 2013 at 23:46

@PaulChernoch No problem - it should be relatively fast, too - I mean, it's all just bit-shifts and casts in theory... – Wiedmann 12/1, 2013 at 0:17

@JerKimball: Unfortunately, it is not just bit shifts and casts. It is also a transfer of the data from wherever the processor keeps floating-point objects to where the processor keeps integer objects. In many modern processors, those are separate registers, and the transfer requires either a special move instruction or a round-trip through memory. That is fine for an occasional operation, but it can be a problem when it must be performed many times in a high-performance application. That is why a solution using only floating-point operations may be appealing. – Biscay 12/1, 2013 at 0:28

@EricPostpischil You are, of course, correct - I wonder how apt it would be to use the GPU for something like this...overkill? – Wiedmann 12/1, 2013 at 0:30

@JerKimball: No, not overkill, provided the data is already in the GPU. There are various applications that need to process a lot of data quickly, and doing things like this is worth it. It still surprises me that, although computers are so fast today, people still have massive amounts of data they want to process even faster, even in consumer applications. – Biscay 12/1, 2013 at 0:42

Instead of using unsafe code, you can use the static methods BitConverter.DoubleToInt64Bits and BitConverter.Int64BitsToDouble to convert between double (System.Double) and long (System.Int64). You can then do the bit-wise AND on the long value in "safe" context, of course. – Yahairayahata 15/1, 2013 at 12:26

@jeppe-stig-nielsen good call. I'd be interested as to how they compare performance-wise, given the original question – Wiedmann 15/1, 2013 at 15:9

As always, if you want to know about performance, measure for yourself. I wouldn't be surprised if DoubleToInt64Bits were implemented almost like your code above. Actually, I just found the source: public static unsafe long DoubleToInt64Bits(double value) { return *((long *)&value); } – Yahairayahata 15/1, 2013 at 15:29

@JeppeStigNielsen Hah - fair enough; only idly commenting since I wasn't in front of a computer. :) – Wiedmann 15/1, 2013 at 15:31

When I have time, I will investigate using DoubleToInt64Bits or BitConverter. The implementation I show above performs very fast, but my boss prefers I not use unsafe code. (In some of my tests, I call it 32 million times when performing a convolution over a single insurance policy. Typically, the application will process 100's of thousands of policies. SO you can see, it needs to be REALLY fast.) – Deidradeidre 23/1, 2013 at 12:15

@PaulChernoch @JeppeStigNielsen Added the BitConverter route example for completeness (momentary twinge of sadness as I got 'unaccepted' due to other way being faster;) ) – Wiedmann 23/1, 2013 at 16:14

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags