Fixed-width Floating-Point Numbers in C/C++

Asked 26/8, 2009 at 0:55 Answered 20/9, 2023 at 10:18

int is usually 32 bits, but in the standard, int is not guaranteed to have a constant width. So if we want a 32 bit int we include stdint.h and use int32_t.

Is there an equivalent for this for floats? I realize it's a bit more complicated with floats since they aren't stored in a homogeneous fashion, i.e. sign, exponent, significand. I just want a double that is guaranteed to be stored in 64 bits with 1 sign bit, 10 bit exponent, and 52/53 bit significand (depending on whether you count the hidden bit).

Kwok answered 26/8, 2009 at 0:55 Comment(9)

Typically the number of bits for an int is needed because it is encoding some sort of object flags. Why do you need assurances on the precision of your floats? In most cases I've seen people tend to overstate the importance of the sizes of random variables. Often you'd be better off using the machine's default word size than trying to squeeze out 3 bytes of memory or arbitrarily using 32 bit values. – Tello 26/8, 2009 at 1:9

@Andrew-Khosravian I'm writing a scripting language, and I'd like to be able to make type-size guarantees to my users. That makes code written in my scripting language more portable. – Kwok 26/8, 2009 at 1:35

Portability is fine, but you need to draw the line somewhere - after all, you're probably not expecting your scripting language to run on a PDP-11. Very few platforms do not support IEEE 754, and if that is supported, then it is a reasonable assumption that double is indeed 64 bits (since it's a double-precision floating point value) - and on the off chance that it is not, build in a sanity check so users can report it and you can handle that platform separately. If the platform doesn't support IEEE 754, you're not going to get that representation anyway unless you implement it yourself. – Barracks 26/8, 2009 at 2:50

int is guaranteed to be at least 16 bits, and long int at least 32 bits (although it's actually defined in terms of the range of values representable) - so if you want a variable that can store any integer from -2147483647 to 2147483647, long int is fine. – Afrikah 26/8, 2009 at 7:16

@Afrikah And if I want a variable that can store exactly 32 bits cross-platform? int32_t. – Kwok 26/8, 2009 at 15:53

@Michael Madsen I'm probably going to end up using the builtin double and handling problems when I come to them as you suggest. But I can't accept comments, only answers... – Kwok 26/8, 2009 at 15:55

I'll copy the comment into my answer, and answer your C++ question as well. – Barracks 26/8, 2009 at 16:54

Imagist: Sure, but if you're using it to provide a data type to your scripting language then there's not much practical difference (remembering that signed integer overflow, even in an int32_t, is undefined behaviour - so "exactly 32 bits" doesn't give you any properties you can take advantage of in this context). – Afrikah 27/8, 2009 at 14:31

Possible duplicate of Fixed-size floating point types – Worldshaking 6/5, 2019 at 8:23

According to the current C99 draft standard, annex F, that should be double. Of course, this is assuming your compilers meet that part of the standard.

For C++, I've checked the 0x draft and a draft for the 1998 version of the standard, but neither seem to specify anything about representation like that part of the C99 standard, beyond a bool in numeric_limits that specifies that IEEE 754/IEC 559 is used on that platform, like Josh Kelley mentions.

Very few platforms do not support IEEE 754, though - it generally does not pay off to design another floating-point format since IEEE 754 is well-defined and works quite nicely - and if that is supported, then it is a reasonable assumption that double is indeed 64 bits (IEEE 754-1985 calls that format double-precision, after all, so it makes sense).

On the off chance that double isn't double-precision, build in a sanity check so users can report it and you can handle that platform separately. If the platform doesn't support IEEE 754, you're not going to get that representation anyway unless you implement it yourself.

Barracks answered 26/8, 2009 at 1:11 Comment(3)

I disagree that IEEE 754 works quite nicely, it is well entrenched so there is not much that can be done about ot. I do agree that what you want is double and you want to add a sanity check that will fail if someone finds a compiler that has a double that is the wrong size. – Basia 28/8, 2009 at 20:3

@dwelch: I'm not saying it doesn't have it's share of issues, or that it is always the best choice, but unless you have a need to be extremely precise or otherwise have very specialized needs when it comes to floating-point, IEEE 754 tends to do the trick, without being exceptionally slow. – Barracks 28/8, 2009 at 21:51

agreed, the format is fine it is all the rules on rounding and exceptions that are the problem, making it so that few if any implementations are correct. Something between IEEE 754 and the TI DSP format would be idea as the TI DSP format has zero features (but is super fast and easy to implement and get right) – Basia 29/8, 2009 at 16:12

While I don't know of a type that guarantees a particular size and format, you do have a few options in C++. You can use the <limits> header and its std::numeric_limits class template to find out the size of a given type, std::numeric_limits::digits tells you the number of bits in the mantissa, and std::numeric_limits::is_iec559 should tell you whether the type follows the IEEE format. (For sample code that manipulates IEEE numbers at the bit level, see the FloatingPoint class template in Google Test's gtest-internal.h.)

Lifton answered 26/8, 2009 at 1:19 Comment(0)

The other issue is representation of floating point numbers. This is usually based on the hardware on which you are running (but not always). Most system are using IEEE 754 Float point standards, but other can have their own standards as well (an example would be a VAX computer).

Wikipedia explaination of IEEE 754 http://en.wikipedia.org/wiki/IEEE_754-2008

Conch answered 26/8, 2009 at 1:15 Comment(0)

There's no variation in float/double that I'm aware of. Float has has been 32 bits for ages and double has been 64. Floating point semantics are pretty complicated, but there do exist constants in

#include <limits>

boost.numeric.bounds is a simpler interface if you don't need everything in std::numeric_limits

Countersign answered 26/8, 2009 at 1:17 Comment(2)

This isn't true across platforms. – Kwok 26/8, 2009 at 1:48

I've seen one compiler (LCC?) that made both float and double 64-bit types. – Marijn 3/7, 2010 at 1:37

From C++23 standard (ISO/IEC 14882:2023) there is <stdfloat>.

cppreference.com mentions the following:

namespace std {
  #if defined(__STDCPP_FLOAT16_T__)
    using float16_t  = /* implementation-defined */;
  #endif
  #if defined(__STDCPP_FLOAT32_T__)
    using float32_t  = /* implementation-defined */;
  #endif
  #if defined(__STDCPP_FLOAT64_T__)
    using float64_t  = /* implementation-defined */;
  #endif
  #if defined(__STDCPP_FLOAT128_T__)
    using float128_t = /* implementation-defined */;
  #endif
  #if defined(__STDCPP_BFLOAT16_T__)
    using bfloat16_t = /* implementation-defined */;
  #endif
}

I am not aware of a C equivalent. However, you could try something like what is suggested here: https://mcmap.net/q/502734/-where-are-the-fixed-width-floating-types

Thermaesthesia answered 20/9, 2023 at 10:18 Comment(0)

Unfortunately, that's not guaranteed either. You have to check numeric_limits< T > in <limits>.

But then again, I've never heard of an implementation where a double wasn't 64 bits long. If you wanted to just assume, you'd probably get away with it.

Eucalyptus answered 26/8, 2009 at 1:8 Comment(0)

-5

One of the biggest problems with these kind of "fixed width types" is that it's so easy to get it wrong. You probably didn't want a 32 bits integer. What's the point? WHat you did want is an integer type that can store at least 1>>31. That's long int. You don't even need <stdint.h> for that.

Similarly, your scripting language can implement an FP type that will work as long as the underlying C++ float is at least 32 bits. Note that this still doesn't give you precise behavior. I'm fairly certain C++ doesn't guarantee -1.0/-3.0==1.0/3.0

Emblematize answered 26/8, 2009 at 10:32 Comment(4)

No, I definitely do want a 32 bit integer. There are a many languages that make guarantees about the sizes of their basic types (C# and Java are two examples). "The point" as you call it, is consistent behavior on all platforms – Kwok 26/8, 2009 at 15:44

Sorry, but you are missing the point. You want your scripting language to have a precisely-32 bits type. That in no way requires you to use a C++ type having exactly 32 bits. You probably also want guaranteed overflow, guranteed behavior of %, etc. So you'er going to reimplement the math anyway. It's trivial then to make this matah modulo 2^32. – Emblematize 27/8, 2009 at 8:40

No, that's int_least32_t, not long int. – Marijn 3/7, 2010 at 1:47

@dan04: check the specs. char is at least 8 bits, int at least 16, and long at least 32. – Emblematize 5/7, 2010 at 7:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags