How are floating point numbers stored in memory?

Asked 4/10, 2011 at 7:47 Answered 15/6, 2015 at 15:35

I've read that they're stored in the form of mantissa and exponent

I've read this document but I could not understand anything.

Preempt answered 4/10, 2011 at 7:47 Comment(3)

The document you linked to explains it rather clearly. What specifically do you find hard to understand? – Moneymaking 4/10, 2011 at 7:54

@MichaelBorgwardt No, it ISN'T clear. It explains how the exponent is stored AFTER introducing a problem that needs this explanation (But what if the number is zero? ... Oh dear). It's like those crime stories where the trick is that they didn't show you all the informations but the main character in the story knows them all. – Nepos 4/10, 2011 at 8:18

FWIW, this is pretty helpful with understanding when paired with the accepted answer: h-schmidt.net/FloatConverter/IEEE754.html – Tranquillity 12/5, 2021 at 19:3

To understand how they are stored, you must first understand what they are and what kind of values they are intended to handle.

Unlike integers, a floating-point value is intended to represent extremely small values as well as extremely large. For normal 32-bit floating-point values, this corresponds to values in the range from 1.175494351 * 10^-38 to 3.40282347 * 10^+38.

Clearly, using only 32 bits, it's not possible to store every digit in such numbers.

When it comes to the representation, you can see all normal floating-point numbers as a value in the range 1.0 to (almost) 2.0, scaled with a power of two. So:

1.0 is simply 1.0 * 2^0,
2.0 is 1.0 * 2^1, and
-5.0 is -1.25 * 2^2.

So, what is needed to encode this, as efficiently as possible? What do we really need?

The sign of the expression.
The exponent
The value in the range 1.0 to (almost) 2.0. This is known as the "mantissa" or the significand.

This is encoded as follows, according to the IEEE-754 floating-point standard.

The sign is a single bit.
The exponent is stored as an unsigned integer, for 32-bits floating-point values, this field is 8 bits. 1 represents the smallest exponent and "all ones - 1" the largest. (0 and "all ones" are used to encode special values, see below.) A value in the middle (127, in the 32-bit case) represents zero, this is also known as the bias.
When looking at the mantissa (the value between 1.0 and (almost) 2.0), one sees that all possible values start with a "1" (both in the decimal and binary representation). This means that it's no point in storing it. The rest of the binary digits are stored in an integer field, in the 32-bit case this field is 23 bits.

In addition to the normal floating-point values, there are a number of special values:

Zero is encoded with both exponent and mantissa as zero. The sign bit is used to represent "plus zero" and "minus zero". A minus zero is useful when the result of an operation is extremely small, but it's still important to know from which direction the operation came from.
plus and minus infinity -- represented using an "all ones" exponent and a zero mantissa field.
Not a Number (NaN) -- represented using an "all ones" exponent and a non-zero mantissa.
Denormalized numbers -- numbers smaller than the smallest normal number. Represented using a zero exponent field and a non-zero mantissa. The special thing with these numbers is that the precision (i.e. the number of digits a value can contain) will drop the smaller the value becomes, simply because there is not room for them in the mantissa.

Finally, the following is a handful of concrete examples (all values are in hex):

1.0 : 3f800000
-1234.0 : c49a4000
100000000000000000000000.0: 65a96816

Dorren answered 4/10, 2011 at 8:39 Comment(2)

is inf, -inf, NaN, zero just notation or does it have any meaning. Cuz having all one exponent and zero mantissa actually equates to 1 right? And all one exponent and non zero mantissa actually returns a legit value... – Shela 1/8, 2022 at 1:24

@Akshay, Inf, -Inf, NaN, Zero, and -Zero all have distinct different encodings, and when applied to floating-point operations, the behaviour is well defined. In fact, NaN:s can be encoded in many different ways, where an implementation can impose meaning on the different variants. (The notation depends on the programming language that you are using.) – Dorren 5/8, 2022 at 9:0

In layman's terms, it's essentially scientific notation in binary. The formal standard (with details) is IEEE 754.

Hook answered 4/10, 2011 at 7:51 Comment(2)

+1 but a Wiki isn't a formal standard, it's at most the explanation of a formal standard :-) :-) – Nepos 4/10, 2011 at 7:53

And C doesn't require IEEE floating-poimt. – Anyaanyah 4/10, 2011 at 8:2

  typedef struct {
      unsigned int mantissa_low:32;     
      unsigned int mantissa_high:20;
      unsigned int exponent:11;        
      unsigned int sign:1;
    } tDoubleStruct;

double a = 1.2;
tDoubleStruct* b = reinterpret_cast<tDoubleStruct*>(&a);

Is an example how memory is set up if compiler uses IEEE 754 double precision which is the default for a C double on little endian systems (e.g. Intel x86).

Here it is in C based binary form and better read wikipedia about double precision to understand it.

Venus answered 4/10, 2011 at 7:53 Comment(3)

That's one possibility, but not the only one. – Anyaanyah 4/10, 2011 at 8:2

Lindydancer mentioned that the significand has one more bit than is stored (except the value 0.0) in the IEEE spec. This is because the significand is normalised, that is, shifted left after a computation (and the exponent decreased) until its m.s. bit is 1. This is so that the maximum number of significant bits are stored as possible. But, since the m.s. bit is known to be a 1 it is not stored, the significand is normalised one bit further, and an extra "virtual" bit of storage was obtained. The significand is not 52 but 53 bits. – Indicatory 17/7, 2019 at 19:20

The c++ structure just shows the content of the memory layout for double in little endian. The memory layout just stores 52 bits. – Venus 18/7, 2019 at 15:26

There are a number of different floating-point formats. Most of them share a few common characteristics: a sign bit, some bits dedicated to storing an exponent, and some bits dedicated to storing the significand (also called the mantissa).

The IEEE floating-point standard attempts to define a single format (or rather set of formats of a few sizes) that can be implemented on a variety of systems. It also defines the available operations and their semantics. It's caught on quite well, and most systems you're likely to encounter probably use IEEE floating-point. But other formats are still in use, as well as not-quite-complete IEEE implementations. The C standard provides optional support for IEEE, but doesn't mandate it.

Anyaanyah answered 4/10, 2011 at 8:1 Comment(0)

The mantissa represents the most significant bits of the number.

The exponent represents how many shifts are to be performed on the mantissa in order to get the actual value of the number.

Encoding specifies how are represented sign of mantissa and sign of exponent (basically whether shifting to the left or to the right).

The document you refer to specifies IEEE encoding, the most widely used.

Britannia answered 4/10, 2011 at 8:13 Comment(0)

I have found the article you referenced quite illegible (and I DO know a little how IEEE floats work). I suggest you try with the Wiki version of the explanation. It's quite clear and has various examples:

http://en.wikipedia.org/wiki/Single_precision and http://en.wikipedia.org/wiki/Double_precision

Nepos answered 4/10, 2011 at 8:21 Comment(0)

It is implementation defined, although IEEE-754 is the most common by far.

To be sure that IEEE-754 is used:

in C, use #ifdef __STDC_IEC_559__
in C++, use the std::numeric_limits<float>::is_iec559 constants

I've written some guides on IEEE-754 at:

Brownstone answered 15/6, 2015 at 15:35 Comment(0)

Recommended topics

Hot tags