Do floats, doubles, and long doubles have a guaranteed minimum precision?

Asked 2/6, 2015 at 5:10 Answered 2/6, 2015 at 16:31

Solved c++floating-point language-lawyer floating-point-precision minimum

From my previous question "Is floating point precision mutable or invariant?" I received a response which said,

C provides DBL_DIG, DBL_DECIMAL_DIG, and their float and long double counterparts. DBL_DIG indicates the minimum relative decimal precision. DBL_DECIMAL_DIG can be thought of as the maximum relative decimal precision.

I looked these macros up. They are found in the header <cfloat>. From the cplusplus reference page they list macros for float, double, and long double.

Here are the macros for minimum precision values.

FLT_DIG 6 or greater

DBL_DIG 10 or greater

LDBL_DIG 10 or greater

If I took these macros at face value, I would assume that a float has a minimum decimal precision of 6, while a double and long double have a minimum decimal precision of 10. However, being a big boy, I know that some things may be too good to be true.

Therefore, I would like to know. Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?

If not, why?

Note: Assume we are using programming language C++.

Sabbatical answered 2/6, 2015 at 5:10 Comment(13)

What does "decimal precision" mean? – Bally 2/6, 2015 at 15:39

@Bally refer to the first link in the question above. – Sabbatical 2/6, 2015 at 15:50

I also can't make any sense of the question in your first link. What statement do you want to make that involves FLT_DIG? Also, you know that these are typically radix-2, not radix-10 formats, right? – Bally 2/6, 2015 at 16:14

@Bally The question in my first link is asking if the decimal precision of floating point values can vary or if they are always the same. This precision is found to change when decimal values are converted to binary and back to decimal again due to rounding error. – Sabbatical 2/6, 2015 at 16:37

OK. What is "the decimal precision"? Any discussion needs to start with a definition of the term, and you haven't yet given one that makes sense outside the context of a decimal floating-point system. – Bally 2/6, 2015 at 16:51

@Bally I believe decimal precision could be interpreted as two different things. As of now, I have found the term being used interchangeably. It's been used to define the number of significant digits in a decimal number. However, the way I'm using it is to define the number of significant digits in a decimal number that have no loss of significance or no loss in the original decimal value after said value has been changed to binary and back to decimal again. – Sabbatical 2/6, 2015 at 16:57

OK. The first definition doesn't make sense for binary floating-point. You should put the second definition into the question so that readers can know exactly what you're talking about. That aligns with the meaning of FLT_DIG and friends as I know it, but I'll leave it to someone else to pull the relevant quotation out of the standard. – Bally 2/6, 2015 at 17:0

Well actually, I said it wrong. The first definition "the number of significant digits in a decimal number" has the same meaning as the second definition "have no loss of significance" because significant numbers are significant in the first place. When you perform operations or get rounding error, the extra or different numbers you get are not significant according to the significant figures definition here. en.wikipedia.org/wiki/Significant_figures includes all digits except spurious digits introduced. – Sabbatical 2/6, 2015 at 17:4

So what I really meant is that decimal precision could also mean the number of digits including non-significant digits in a decimal value. – Sabbatical 2/6, 2015 at 17:5

You want the maximum n such that an n-digit decimal number can be roundtripped through a binary floating-point number without changing its value, right? I have no idea what "significant digits" have to do with anything. – Bally 2/6, 2015 at 17:5

That's correct. And I believe I answered my own question correctly below according to my specific compiler implementations. Please correct me if I'm wrong. – Sabbatical 2/6, 2015 at 17:7

@WanderingFool: For round-trip precision data for the case where the C types are bound to the IEEE-754 specified types, see this blog entry by a knowledgeable author: exploringbinary.com/… – Phonation 2/6, 2015 at 23:28

Another related question that goes into more detail about floating-point precision limits is, Is the most significant decimal digits precision that can be converted to binary and back to decimal without loss of significance 6 or 7.225?. – Sabbatical 15/6, 2015 at 19:44

If std::numeric_limits<F>::is_iec559 is true, then the guarantees of the IEEE 754 standard apply to floating point type F.

Otherwise (and anyway), minimum permitted values of symbols such as DBL_DIG are specified by the C standard, which, undisputably for the library, “is incorporated into [the C++] International Standard by reference”, as quoted from C++11 §17.5.1.5/1.

Edit: As noted by TC in a comment here,

” <climits> and <cfloat> are normatively incorporated by §18.3.3 [c.limits]; the minimum values are specified in turn in §5.2.4.2.2 of the C standard

Unfortunately for the formal view, first of all that quote from C++11 is from section 17.5 which is only informative, not normative. And secondly, the wording in the C standard that the values specified there are minimums, is also in a section (the C99 standard's Annex E) that's informative, not normative. So while it can be regarded as an in-practice guarantee, it's not a formal guarantee.

~~One strong indication that the in-practice minimum precision for float is 6 decimal digits, that no implementation will give less:~~

output operations default to precision 6, and this is normative text.

~~Disclaimer: It may be that there is additional wording that provides guarantees that I didn't notice. Not very likely, but possible.~~

Urbina answered 2/6, 2015 at 5:48 Comment(2)

<climits> and <cfloat> are normatively incorporated by §18.3.3 [c.limits]; the minimum values are specified in turn in §5.2.4.2.2 of the C standard. – Christalchristalle 2/6, 2015 at 6:2

@T.C.: Thanks! Updated, & removed the disclaimer (no longer necessary). :) – Urbina 2/6, 2015 at 13:51

Do floats, doubles, and long doubles have guaranteed minimum decimal precision, and is this minimum decimal precision the values of the macros given above?

I can't find any place in the standard that guarantees any minimal values for decimal precision.

The following quote from http://en.cppreference.com/w/cpp/types/numeric_limits/digits10 might be useful:

Example

An 8-bit binary type can represent any two-digit decimal number exactly, but 3-digit decimal numbers 256..999 cannot be represented. The value of digits10 for an 8-bit type is 2 (8 * std::log10(2) is 2.41)

The standard 32-bit IEEE 754 floating-point type has a 24 bit fractional part (23 bits written, one implied), which may suggest that it can represent 7 digit decimals (24 * std::log10(2) is 7.22), but relative rounding errors are non-uniform and some floating-point values with 7 decimal digits do not survive conversion to 32-bit float and back: the smallest positive example is 8.589973e9, which becomes 8.589974e9 after the roundtrip. These rounding errors cannot exceed one bit in the representation, and digits10 is calculated as (24-1)*std::log10(2), which is 6.92. Rounding down results in the value 6.

However, the C standard specifies the minimum values that need to be supported. From the C Standard:

5.2.4.2.2 Characteristics of floating types

...

9 The values given in the following list shall be replaced by constant expressions with implementation-defined values that are greater or equal in magnitude (absolute value) to those shown, with the same sign

...

-- number of decimal digits, q, such that any floating-point number with q decimal digits can be rounded into a floating-point number with p radix b digits and back again without change to the q decimal digits,

...

FLT_DIG 6
DBL_DIG 10
LDBL_DIG 10

Kachine answered 2/6, 2015 at 5:32 Comment(7)

Re "the standard could guarantee a precision of 2 for an 8-bit representations of floats", well that would conflict with limits (informally) required by the C standard, as well as with (normative) required default precision 6 for output. – Urbina 2/6, 2015 at 5:59

With a sign bit and 7 bits of mantissa (to get the 2 digits precision), there would be no space for an exponent, leading the whole idea of a floating-point number ad absurdum. – Lustreware 2/6, 2015 at 9:51

@Cheersandhth.-Alf, does it imply that a confirming implementation must have at least 32-bit representation for floats? – Kachine 2/6, 2015 at 14:50

@RSahu: The required minimum 6 decimal digits precision (via FLT_DIG) means a required minimum of 10^6 distinct values. Which is about 2^20. So that's 20 bits just for the mantissa. Then you need an exponent, which appears to have minimum some 74 values or thereabouts, which needs 7 bits. Then a sign bit, and then we have a minimum of 28 bits in total. I'd say 32, yes. – Urbina 2/6, 2015 at 15:15

@Cheersandhth.-Alf isn't 2^20 21 bits, since 2^0 is the first bit in binary? – Sabbatical 2/6, 2015 at 15:46

@WanderingFool: 20 bits gives 2^20 possible bit value patterns. When each patterns stands for a unique value, then that's 2^20 values. If those values are numbered 0, 1, 2, and so on, then the integer value 2^20 isn't among those values, but would follow right after the last one. – Urbina 2/6, 2015 at 16:4

My mistake. I confused value for bit. – Sabbatical 2/6, 2015 at 16:13

To be more specific. Since my compiler uses the IEEE 754 Standard, then the precision of my decimal digits are guaranteed to be 6 to 9 significant decimal digits for float and 15 to 17 significant decimal digits for double. Also, since a long double on my compiler is the same size as a double, it too has 15 to 17 significant decimal digits.

These ranges can be verified from IEEE 754 single-precision binary floating-point format: binary32 and IEEE 754 double-precision binary floating-point format: binary64 respectively.

Sabbatical answered 2/6, 2015 at 16:31 Comment(0)

-1

The C++ Standard says nothing specific about limits on floating point types. You may interpret the incorporation of the C Standard "by reference" as you wish, but if you take the limits as specified there (N1570), section 5.2.4.2.2 subpoint 15:

EXAMPLE 1 The following describes an artificial floating-point representation that meets the minimum requirements of this International Standard, and the appropriate values in a header for type float:
FLT_RADIX 16
FLT_MANT_DIG 6
FLT_EPSILON 9.53674316E-07F
FLT_DECIMAL_DIG 9
FLT_DIG 6
FLT_MIN_EXP -31
FLT_MIN 2.93873588E-39F
FLT_MIN_10_EXP -38
FLT_MAX_EXP +32
FLT_MAX 3.40282347E+38F
FLT_MAX_10_EXP +38

By this section, float, double and long double have these properties at the least*.

Refugee answered 2/6, 2015 at 6:4 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags