What is a good way to round double-precision values to a (somewhat) lower precision?

Asked 4/1, 2013 at 2:7 Answered 15/5, 2019 at 11:16

My problem is that I have to use a thrid-party function/algorithm which takes an array of double-precision values as input, but apparently can be sensitive to very small changes in the input data. However for my application I have to get identical results for inputs that are (almost) identical! In particular I have two test input arrays which are identical up to the 5-th position after the decimal point and still I get different results. So what causes the "problem" must be after the 5-th position after the decimal point.

Now my idea was to round the input to a slightly lower precision in order to get identical results from inputs that are very similar, yet not 100% identical. Therefore I am looking for a good/efficient way to round double-precision values to a slightly lower precision. So far I am using this code to round to the 9-th position after the decimal point:

double x = original_input();
x = double(qRound(x * 1000000000.0)) / 1000000000.0;

Here qRound() is the normal double to integer rounding function from Qt. This code works and it indeed resolved my problem with the two "problematic" test sets. But: Is there a more efficient way to this?

Also what bothers me: Rounding to the 9-th position after the decimal point might be reasonable for input data that is in the -100.0 to 100.0 range (as is the case with my current input data). But it may be too much (i,e, too much precision loss) for input data in the -0.001 to 0.001 range, for example. Unfortunately I don't know in what range my input values will be in other cases...

After all, I think what I would need is something like a function which does the following: Cut off, by proper rounding, a given double-precision value X to at most L-N positions after the decimal point, where L is the number of positions after the decimal point that double-precision can store (represent) for the given value; and N is fixed, like 3. It means that for "small" values we would allow more positions after the decimal point than for "large" values. In other words I would like to round the 64-Bit floating-point value to a (somewhat) smaller precision like 60-Bit or 56-Bit and then store it back to a 64-Bit double value.

Does this make sense to you? And if so, can you suggest a way to do this (efficiently) in C++ ???

Thanks in advance!

Bryson answered 4/1, 2013 at 2:7 Comment(6)

Do you want to round it base-10, or is base-2 fine too? – Saritasarkaria 4/1, 2013 at 2:10

Hi.I think base-2 would be okay too, as long as it adapts to the input. – Bryson 4/1, 2013 at 2:19

The idea is fundamentally flawed. All numbers are "almost identical" in the sense that 1.00 is almost identical to 1.01 and 1.01 is almost identical to 1.02, etc. Thus, if f(1.00) == f(1.01) and f(1.01)==f(1.02) then also f(1.00)==f(1.02) and also f(1.00)==f(1E7) – Herniorrhaphy 4/1, 2013 at 9:33

MSalters, I understand what you mean. Sure, actually what I do is to "quantize" the input values into a number of "bins", where each bin covers a certain range of input values (with the bins getting "wider" for "large" values). Finally I replace the value with its bin's mean value. It can still happen that two values are very close, but one value happens to be just "left" of the boundary and the other happens to be just "right" of the boundary. Probably a case I need to live with. Or do you have better suggestions to deal with that? – Bryson 4/1, 2013 at 14:45

Rounding does not solve the problem you describe. I will use decimal for easy illustration. Suppose you have two results you expect to be identical, 10.22 and 10.24, and you round them to three digits, obtaining 10.2 and 10.2 They are identical, and things are fine. However, if the results were 10.24 and 10.26, then rounding them produces 10.2 and 10.3, and they are not identical. Rounding will not make close but non-identical results identical in the absence of some other specification, such as that all the results are near the centers of the rounding intervals, not near the borders. – Huxley 11/1, 2013 at 20:16

Nonetheless, if you wish to round, I show an efficient way to do this, Dekker’s algorithm, in this answer. – Huxley 11/1, 2013 at 20:18

If you take a look at the double bit layout, you can see how to combine it with a bit of bitwise magic to implement fast (binary) rounding to arbitrary precision. You have the following bit layout:

SEEEEEEEEEEEFFFFFFFFFFF.......FFFFFFFFFF

where S is the sign bit, the Es are exponent bits, and the Fs are fraction bits. You can make a bitmask like this:

11111111111111111111111.......1111000000

and bitwise-and (&) the two together. The result is a rounded version of the original input:

SEEEEEEEEEEEFFFFFFFFFFF.......FFFF000000

And you can control how much data is chopped off by changing the number of trailing zeros. More zeros = more rounding; fewer = less. You also get the other effect that you want: small input values are affected proportionally less that large input values, since what "place" each bit corresponds to is determined by the exponent.

Hope that helps!

Caveat: This is technically truncation rather than true rounding (all values will become closer to zero, regardless of how close they are to the other possible result), but hopefully it's just as useful in your case.

Telephotography answered 4/1, 2013 at 2:36 Comment(3)

Thanks, Xavier. This might be a solution. But how exactly do I do that bit-operation in C++ code? Cast the "double" value to an "unsigned char*" pointer? I need to care about Endianness? – Bryson 4/1, 2013 at 2:49

@Bryson - That's probably the best way to do it. I looked around a bit, and there's nothing particularly convenient... There's also the "union hack" - people around here love to point out that it's technically undefined behaviour, but it should work on every mainstream compiler. – Telephotography 4/1, 2013 at 3:0

The only legal way of doing the cast is by doing a memcpy from the double to an integer (and back). Type punning and the union hack are both undefined behavior. The memcpy version could be wrapped into a byte_cast<> template – Bo 7/6, 2013 at 10:39

Thanks for the input so far.

However after some more searching, I came across frexp() and ldexp() functions! These functions give me access to the "mantissa" and "exponent" of a given double value and can also convert back from mantissa + exponent to a double. Now I just need to round the mantissa.

double value = original_input();
static const double FACTOR = 32.0;
int exponent;
double temp = double(round(frexp(value, &exponent) * FACTOR));
value = ldexp(temp / FACTOR, exponent);

I don't know if this is efficient at all, but it gives reasonable results:

0.000010000000000   0.000009765625000
0.000010100000000   0.000010375976563
0.000010200000000   0.000010375976563
0.000010300000000   0.000010375976563
0.000010400000000   0.000010375976563
0.000010500000000   0.000010375976563
0.000010600000000   0.000010375976563
0.000010700000000   0.000010986328125
0.000010800000000   0.000010986328125
0.000010900000000   0.000010986328125
0.000011000000000   0.000010986328125
0.000011100000000   0.000010986328125
0.000011200000000   0.000010986328125
0.000011300000000   0.000011596679688
0.000011400000000   0.000011596679688
0.000011500000000   0.000011596679688
0.000011600000000   0.000011596679688
0.000011700000000   0.000011596679688
0.000011800000000   0.000011596679688
0.000011900000000   0.000011596679688
0.000012000000000   0.000012207031250
0.000012100000000   0.000012207031250
0.000012200000000   0.000012207031250
0.000012300000000   0.000012207031250
0.000012400000000   0.000012207031250
0.000012500000000   0.000012207031250
0.000012600000000   0.000012817382813
0.000012700000000   0.000012817382813
0.000012800000000   0.000012817382813
0.000012900000000   0.000012817382813
0.000013000000000   0.000012817382813
0.000013100000000   0.000012817382813
0.000013200000000   0.000013427734375
0.000013300000000   0.000013427734375
0.000013400000000   0.000013427734375
0.000013500000000   0.000013427734375
0.000013600000000   0.000013427734375
0.000013700000000   0.000013427734375
0.000013800000000   0.000014038085938
0.000013900000000   0.000014038085938
0.000014000000000   0.000014038085938
0.000014100000000   0.000014038085938
0.000014200000000   0.000014038085938
0.000014300000000   0.000014038085938
0.000014400000000   0.000014648437500
0.000014500000000   0.000014648437500
0.000014600000000   0.000014648437500
0.000014700000000   0.000014648437500
0.000014800000000   0.000014648437500
0.000014900000000   0.000014648437500
0.000015000000000   0.000015258789063
0.000015100000000   0.000015258789063
0.000015200000000   0.000015258789063
0.000015300000000   0.000015869140625
0.000015400000000   0.000015869140625
0.000015500000000   0.000015869140625
0.000015600000000   0.000015869140625
0.000015700000000   0.000015869140625
0.000015800000000   0.000015869140625
0.000015900000000   0.000015869140625
0.000016000000000   0.000015869140625
0.000016100000000   0.000015869140625
0.000016200000000   0.000015869140625
0.000016300000000   0.000015869140625
0.000016400000000   0.000015869140625
0.000016500000000   0.000017089843750
0.000016600000000   0.000017089843750
0.000016700000000   0.000017089843750
0.000016800000000   0.000017089843750
0.000016900000000   0.000017089843750
0.000017000000000   0.000017089843750
0.000017100000000   0.000017089843750
0.000017200000000   0.000017089843750
0.000017300000000   0.000017089843750
0.000017400000000   0.000017089843750
0.000017500000000   0.000017089843750
0.000017600000000   0.000017089843750
0.000017700000000   0.000017089843750
0.000017800000000   0.000018310546875
0.000017900000000   0.000018310546875
0.000018000000000   0.000018310546875
0.000018100000000   0.000018310546875
0.000018200000000   0.000018310546875
0.000018300000000   0.000018310546875
0.000018400000000   0.000018310546875
0.000018500000000   0.000018310546875
0.000018600000000   0.000018310546875
0.000018700000000   0.000018310546875
0.000018800000000   0.000018310546875
0.000018900000000   0.000018310546875
0.000019000000000   0.000019531250000
0.000019100000000   0.000019531250000
0.000019200000000   0.000019531250000
0.000019300000000   0.000019531250000
0.000019400000000   0.000019531250000
0.000019500000000   0.000019531250000
0.000019600000000   0.000019531250000
0.000019700000000   0.000019531250000
0.000019800000000   0.000019531250000
0.000019900000000   0.000019531250000
0.000020000000000   0.000019531250000
0.000020100000000   0.000019531250000

Seems to like what I was looking for after all:

http://img833.imageshack.us/img833/9055/clipboard09.png

Now I just need to find good FACTOR value for my function....

Any comments or suggestions?

Bryson answered 4/1, 2013 at 14:42 Comment(1)

I need to exactly the same thing, also in base 2, but in C#. Hopefully I can find similar ways of fiddling with double precision bits in C# as in C++. – Excelsior 11/1, 2013 at 18:53

The business scenario is not evident from the question; still I feel you are trying to see the values are within an acceptable range. Rather than ==, you can check if the second value is within a certain % range (say +/- 0.001%)

If the range percentage cannot be fixed (mean, varies based on precision length; say, for 2 decimal places, 0.001 percent is fine but for 4 decimal 0.000001 percent is needed) then, you can arrive it by 1/mantissa.

Rambler answered 7/6, 2013 at 10:32 Comment(0)

I know that this question is quite old but I also searched for an approach to round double values to a lower precision. Maybe, this answer helps someone out there.

Imagine a floating point number in binary representation. For example 1101.101. The bits 1101 represent the integral part of the number and are weighted with 2^3, 2^2, 2^1, 2^0 from left to right. The bits 101 on the fractional part are weighted with 2^-1, 2^-2, 2^-3, which equals 1/2, 1/4, 1/8.

So what is the decimal error, you produce when you cut off your number two bits after the decimal point? It is 0.125 in this example, since the bit is set. If the bit would not be set, the error is 0. So, the error is <= 0.125.

Now think in a more general way: If you had an infinitely long mantissa, the fractional part converges to 1 (see here). In fact, you only have 52 bits (see here), so the sum is "almost" 1. So cutting off all fractional bits will cause an error of <= 1 which is not really a surprise! (Keep in mind, that your integral part also occupies mantissa space! But if you assume a number like 1.5which is 1.1 in binary representation, your mantissa only stores the part after the decimal point.)

Since cutting off all fractional bits causes an error of <= 1, cutting off all but the first bit right of the decimal point causes an error of <= 1/2 because this bit is weighted with 2^-1. Keeping a further bit decreases your error to <= 1/4.

This can be described by a function f(x) = 1/2^(52-x) where x is the number of cut off bits counted from the right side and y = f(x) is an upper bound of your resulting error.

Rounding by two places after the decimal point means "grouping" numbers by common hundredths. This can be done with the above function:1/100 >= 1/2^(52-x). This means that your resulting error is bounded by a hundredth when cutting off x bits. Solving this inequation by x yields: 52-log2(100) >= x where 52-log2(100) is 45.36. This means that cutting off not more than 45 bits ensures a "precision" of two decimal(!) positions after the floating point.

In general, your mantissa consists of an integral and a fractional part. Let's call their lengths i and f. Positive exponents describe i. Moreover 52=f+i holds. The solution of the above inequation changes to: 52-i-log2(10^n) >= x because after your fractional part is over, you have to stop cutting off the mantissa! (n is the decimal precision here.)

Applying logarithm rules, you can compute the number of maximum allowed bits to cut off like this:

x = f - (uint16_t) ceil(n / 0.3010299956639812); where the constant represents log10(2). Truncation can then be done with:

mantissa >>= x; mantissa <<= x;

If x is larger than f, remember to only shift by f. Otherwise, you will affect your mantissa's integral part.

Worrisome answered 15/5, 2019 at 11:16 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags