I am reading "accelerated C++". I found one sentence which states "sometimes double
is faster in execution than float
in C++". After reading sentence I got confused about float
and double
working. Please explain this point to me.
Depends on what the native hardware does.
If the hardware is (or is like) x86 with legacy x87 math, float and double are both extended (for free) to an internal 80-bit format, so both have the same performance (except for cache footprint / memory bandwidth)
If the hardware implements both natively, like most modern ISAs (including x86-64 where SSE2 is the default for scalar FP math), then usually most FPU operations are the same speed for both. Double division and sqrt can be slower than float, as well as of course being significantly slower than multiply or add. (Float being smaller can mean fewer cache misses. And with SIMD, twice as many elements per vector for loops that vectorize).
If the hardware implements only double, then float will be slower if conversion to/from the native double format isn't free as part of float-load and float-store instructions.
If the hardware implements float only, then emulating double with it will cost even more time. In this case, float will be faster.
And if the hardware implements neither, and both have to be implemented in software. In this case, both will be slow, but double will be slightly slower (more load and store operations at the least).
The quote you mention is probably referring to the x86 platform, where the first case was given. But this doesn't hold true in general.
Also beware that x * 3.3 + y
for float x,y will trigger promotion to double for both variables. This is not the hardware's fault, and you should avoid it by writing 3.3f
to let your compiler make efficient asm that actually keeps numbers as floats if that's what you want.
long double
as a 64-bit type), but if you need more than 64-bit precision but don't need 128-bit double-double
, 80-bit long double is by far the fastest option. –
Tomfool float
(SSE) and double
(SSE2). SSE2 is baseline for x86-64. Modern x86 CPUs have SIMD with the same performance per vector for float or double add/mul/FMA (thus twice the FLOPS for float
because of twice the elements per vector). Mysticial has a detailed answer on How do I achieve the theoretical maximum of 4 FLOPs per cycle?. double
division / sqrt is slower than float
Floating point division vs floating point multiplication –
Tomfool You can find a complete answer in this article:
What Every Computer Scientist Should Know About Floating-Point Arithmetic
This is a quote from a previous Stack Overflow thread, about how float
and double
variables affect memory bandwidth:
If a double requires more storage than a float, then it will take longer to read the data. That's the naive answer. On a modern IA32, it all depends on where the data is coming from. If it's in L1 cache, the load is negligible provided the data comes from a single cache line. If it spans more than one cache line there's a small overhead. If it's from L2, it takes a while longer, if it's in RAM then it's longer still and finally, if it's on disk it's a huge time. So the choice of float or double is less imporant than the way the data is used. If you want to do a small calculation on lots of sequential data, a small data type is preferable. Doing a lot of computation on a small data set would allow you to use bigger data types with any significant effect. If you're accessing the data very randomly, then the choice of data size is unimportant - data is loaded in pages / cache lines. So even if you only want a byte from RAM, you could get 32 bytes transfered (this is very dependant on the architecture of the system). On top of all of this, the CPU/FPU could be super-scalar (aka pipelined). So, even though a load may take several cycles, the CPU/FPU could be busy doing something else (a multiply for instance) that hides the load time to a degree
Short answer is: it depends.
CPU with x87 will crunch floats and doubles equally fast. Vectorized code will run faster with floats, because SSE can crunch 4 floats or 2 doubles in one pass.
Another thing to consider is memory speed. Depending on your algorithm, your CPU could be idling a lot while waiting for the data. Memory intensive code will benefit from using floats, but ALU limited code won't (unless it is vectorized).
I can think of two basic cases when doubles are faster than floats:
Your hardware supports double operations but not float operations, so floats will be emulated by software and therefore be slower.
You really need the precision of doubles. Now, if you use floats anyway you will have to use two floats to reach similar precision to double. The emulation of a true double with floats will be slower than using floats in the first place.
- You do not necessarily need doubles but your numeric algorithm converges faster due to the enhanced precision of doubles. Also, doubles might offer enough precision to use a faster but numerically less stable algorithm at all.
For completeness' sake I also give some reasons for the opposite case of floats being faster. You can see for yourself whichs reasons dominate in your case:
Floats are faster than doubles when you don't need double's precision and you are memory-bandwidth bound and your hardware doesn't carry a penalty on floats.
They conserve memory-bandwidth because they occupy half the space per number.
There are also platforms that can process more floats than doubles in parallel.
On Intel, the coprocessor (nowadays integrated) will handle both equally fast, but as some others have noted, doubles result in higher memory bandwidth which can cause bottlenecks. If you're using scalar SSE instructions (default for most compilers on 64-bit), the same applies. So generally, unless you're working on a large set of data, it doesn't matter much.
However, parallel SSE instructions will allow four floats to be handled in one instruction, but only two doubles, so here float can be significantly faster.
In experiments of adding 3.3 for 2000000000 times, results are:
Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double
So double is faster and default in C and C++. It's more portable and the default across all C and C++ library functions. Alos double has significantly higher precision than float.
Even Stroustrup recommends double over float:
"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best."
Perhaps the only case where you should use float instead of double is on 64bit hardware with a modern gcc. Because float is smaller; double is 8 bytes and float is 4 bytes.
float
test, probably because you didn't include any "warm up" in your benchmark. Idiomatic way of performance evaluation?. float
should be the same speed as double
on a normal C++ implementation on x86 or ARM or whatever. Unless you do it wrong and do the float version in a way that has to convert to double and back because in C++ 3.3
is a double constant, unlike 3.3f
. But if that was the case, you'd expect a bigger slowdown. –
Tomfool double
, just that being faster is not a reason. (Unless you misuse C++ and make the compiler convert to double and back by writing things like x * 3.3 + y
.) –
Tomfool float is usually faster. double offers greater precision. However performance may vary in some cases if special processor extensions such as 3dNow or SSE are used.
There is only one reason 32-bit floats can be slower than 64-bit doubles (or 80-bit 80x87). And that is alignment. Other than that, floats take less memory, generally meaning faster access, better cache performance. It also takes fewer cycles to process 32-bit instructions. And even when (co)-processor has no 32-bit instructions, it can perform them on 64-bit registers with the same speed. It probably possible to create a test case where doubles will be faster than floats, and v.v., but my measurements of real statistics algos didn't show noticeable difference.
fld
/ fstp
. e.g. Skylake fstp tbyte
is 7 uops, with throughput of 1 per 5 cycles vs. 1 uop and 1 per clock for normal float/double stores. See this answer for more –
Tomfool © 2022 - 2024 — McMap. All rights reserved.