Is using double faster than float?
Asked Answered
S

10

80

Double values store higher precision and are double the size of a float, but are Intel CPUs optimized for floats?

That is, are double operations just as fast or faster than float operations for +, -, *, and /?

Does the answer change for 64-bit architectures?

Saimon answered 6/8, 2010 at 17:23 Comment(2)
It depends what you are doing with them. In theory, memory bandwidth could come into it. Do you have any more information?Soninlaw
FYI a duplicate question here has some good information also.Fayalite
G
86

There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:

are double operations just as fast or faster than float operations for +, -, *, and /?

is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for double than for float. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double).

For example, Haswell has a divsd throughput of one per 8 to 14 cycles (data-dependent), but a divss (scalar single) throughput of one per 7 cycles. x87 fdiv is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)

The float versions of many library functions like logf(float) and sinf(float) will also be faster than log(double) and sin(double), because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for float vs. double


However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial.

@Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.

In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).

Gallaher answered 6/8, 2010 at 17:33 Comment(7)
This would also depend on the cache block size, correct? If your cache retrieves 64bit or larger blocks, then a double would be just as efficient (if not faster) than a float, at least so far as memory reads/writes is concerned.Fiddlefaddle
@Razor If you work exactly as many floats as fit in a cache line, then if you used doubles instead the CPU will have to fetch two cache lines. The caching effect I had in mind when reading Alex' answer however is: Your set of floats fits in you nth level cache but the corresponding set of doubles doesn't. In this case you will experience a big boost in performance if you use floats.Anthropology
@Peter, yeah that makes sense, say you have a 32 bit cacheline, using doubles would have to fetch twice every time.Fiddlefaddle
@Razor, the problem's not really with fetching/storing just one value -- it is, as @Peter's focus correctly indicates, that often you're fetching "several" values to operate on (an array of numbers would be a typical example, and operations on items of such arrays very common in numerical applications). There are counterexamples (e.g., a pointer-connected tree where each node only has one number and a lot of other stuff: then having that number be 4 or 8 bytes will matter pretty little), which is part of why I say that in the end you have to benchmark, but the idea often applies.Gallaher
@Alex Martelli, I see. That makes sense.Fiddlefaddle
i did ten iterations of a loop where the loop did std::vector<std::complex<float or double>>, size=10*1000*1000, filled by rand(). const auto p2 = x[i] * x[i]; const auto p4 = p2 * p2; const auto p8 = p4 * p4; y[i] = p8;. the float elapsed time was 0.95 seconds. the double elapsed time was 0.24 seconds.Fayalite
double add/sub/mul is as fast as float in modern x86 CPUs, but not div or sqrt. Double has somewhat worse latency and throughput. Floating point division vs floating point multiplicationChuffy
D
28

If all floating-point calculations are performed within the FPU, then, no, there is no difference between a double calculation and a float calculation because the floating point operations are actually performed with 80 bits of precision in the FPU stack. Entries of the FPU stack are rounded as appropriate to convert the 80-bit floating point format to the double or float floating-point format. Moving sizeof(double) bytes to/from RAM versus sizeof(float) bytes is the only difference in speed.

If, however, you have a vectorizable computation, then you can use the SSE extensions to run four float calculations in the same time as two double calculations. Therefore, clever use of the SSE instructions and the XMM registers can allow higher throughput on calculations that only use floats.

Douce answered 6/8, 2010 at 18:0 Comment(0)
A
13

Another point to consider is if you are using GPU(the graphics card). I work with a project that is numerically intensive, yet we do not need the percision that double offers. We use GPU cards to help further speed the processing. CUDA GPU's need a special package to support double, and the amount of local RAM on a GPU is quite fast, but quite scarce. As a result, using float also doubles the amount of data we can store on the GPU.

Yet another point is the memory. Floats take half as much RAM as doubles. If you are dealing with VERY large datasets, this can be a really important factor. If using double means you have to cache to disk vs pure ram, your difference will be huge.

So for the application I am working with, the difference is quite important.

Athanasian answered 6/8, 2010 at 18:6 Comment(0)
D
12

I just want to add to the already existing great answers that the __m256? family of same-instruction-multiple-data (SIMD) C++ intrinsic functions operate on either 4 double s in parallel (e.g. _mm256_add_pd), or 8 floats in parallel (e.g. _mm256_add_ps).

I'm not sure if this can translate to an actual speed up, but it seems possible to process 2x as many floats per instruction when SIMD is used.

Dialyser answered 14/10, 2012 at 1:35 Comment(0)
M
10

In experiments of adding 3.3 for 2000000000 times, results are:

Summation time in s: 2.82 summed value: 6.71089e+07 // float
Summation time in s: 2.78585 summed value: 6.6e+09 // double
Summation time in s: 2.76812 summed value: 6.6e+09 // long double

So double is faster and default in C and C++. It's more portable and the default across all C and C++ library functions. Alos double has significantly higher precision than float.

Even Stroustrup recommends double over float:

"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best."

Perhaps the only case where you should use float instead of double is on 64bit hardware with a modern gcc. Because float is smaller; double is 8 bytes and float is 4 bytes.

Morava answered 18/3, 2012 at 18:20 Comment(4)
+1 for making the effort to do some timings. But Stroustrup doesn't recommend using 'double' because it's faster, but because of the extra precision. Regarding your last comment, if you need that extra precision more than saving memory, then it's quite possible you'd want to use 'double' on 32-bit hardware. And that leads back to the question: Is double faster than float even on 32-bit hardware with a modern FPU that does 64-bit computations?Saimon
A few hundredths of a second difference feels like it's still within the realm of experimental error. Especially if there's other stuff too (like maybe a not-unrolled loop . . .).Avionics
It's quite a stretch to say that Stroustrup is recommending double there when he is actually recommending to RTFM.Ante
What hardware, what compiler + options, what code? If you timed all 3 in the same program, clock-speed ramp-up time explains the first being slower. Clearly you didn't enable auto-vectorization (impossible for a reduction without -ffast-math or whatever, because FP math isn't strictly associative). So this only proves that there's no speed difference when the bottleneck is scalar FP add latency. The bit about 64-bit hardware makes no sense either: float is always half the size of double on any normal hardware. The only difference on 64-bit hardware is that x86-64 has SSE2 as a baseline.Chuffy
R
7

The only really useful answer is: only you can tell. You need to benchmark your scenarios. Small changes in instruction and memory patterns could have a significant impact.

It will certainly matter if you are using the FPU or SSE type hardware (former does all its work with 80bit extended precision, so double will be closer; later is natively 32bit, i.e. float).

Update: s/MMX/SSE/ as noted in another answer.

Randa answered 6/8, 2010 at 17:27 Comment(0)
I
5

Alex Martelli's answer is good enough, but I want to mention a wrong but somewhat popular test method that may have misled some people:

#include <cstdio>
#include <ctime>
int main() {
  const auto start_clock = clock();
  float a = 0;
  for (int i = 0; i < 256000000; i++) {
    // bad latency benchmark that includes as much division as other operations
    a += 0.11;  // note the implicit conversions of a to double to match 0.11
    a -= 0.13;  // rather than 0.11f
    a *= 0.17;
    a /= 0.19;
  }
  printf("c++ float duration = %.3f\n", 
    (double)(clock() - start_clock) / CLOCKS_PER_SEC);
  printf("%.3f\n", a);
  return 0;
}

It's wrong! C++ default use double, if you replace += 0.11 by += 0.11f, float will usually be faster then double, on x86 CPU.

By the way, on modern SSE instruction set, both float and double have same speed except of division operation, in the CPU core itself. float being smaller may have fewer cache misses if you have arrays of them.

And if the compiler can auto-vectorize, float vectors work on twice as many elements per instruction as double.

Impudicity answered 1/9, 2021 at 17:12 Comment(0)
L
1

Previous answers missing a factor that may cause big diff(> 4 X) between float and double: denormal. Avoiding denormal values in C++ Since double have a much wider normal range, for a specific problem that contains many small values, There is much higher probability to fall into denormal range with float than with double, so float could be much slower than double in this case.

Livy answered 30/9, 2022 at 2:36 Comment(0)
S
0

Floating point is normally an extension to one's general purpose CPU. The speed will therefore be dependent on the hardware platform used. If the platform has floating point support, I will be surprised if there is any difference.

Secrest answered 6/8, 2010 at 17:33 Comment(0)
O
0

In addition some real data of a benchmark to get a glimpse:

For Intel 3770k, GCC 9.3.0 -O2 [3]
Run on (8 X 3503 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
BM_FloatCreation               0.281 ns        0.281 ns   1000000000
BM_DoubleCreation              0.284 ns        0.281 ns   1000000000
BM_Vector3FCopy                0.558 ns        0.562 ns   1000000000
BM_Vector3DCopy                 5.61 ns         5.62 ns    100000000
BM_Vector3F_CopyDefault        0.560 ns        0.546 ns   1000000000
BM_Vector3D_CopyDefault         5.57 ns         5.56 ns    112178768
BM_Vector3F_Copy123            0.841 ns        0.817 ns    897430145
BM_Vector3D_Copy123             5.59 ns         5.42 ns    112178768
BM_Vector3F_Add                0.841 ns        0.834 ns    897430145
BM_Vector3D_Add                 5.59 ns         5.46 ns    100000000
BM_Vector3F_Mul                0.842 ns        0.782 ns    897430145
BM_Vector3D_Mul                 5.60 ns         5.56 ns    112178768
BM_Vector3F_Compare            0.840 ns        0.800 ns    897430145
BM_Vector3D_Compare             5.61 ns         5.62 ns    100000000
BM_Vector3F_ARRAY_ADD           3.25 ns         3.29 ns    213673844        
BM_Vector3D_ARRAY_ADD           3.13 ns         3.06 ns    224357536        

where operations on 3 float(F) or 3 double(D) are compared and - BM_Vector3XCopy is the pure copy of a (1,2,3) initialized vector not repeated before copy, - BM_Vector3X_CopyDefault with default initialization repeated every copy, - BM_Vector3X_Copy123 with repeated initialization of (1,2,3),

  • Add/Mul Each initialize 3 vectors(1,2,3) and add/multiplicate the first and second into the third,
  • Compare Checks for equality of two initialized vectors,

  • ARRAY_ADD Sums up vector(1,2,3) + vector(3,4,5) + vector(6,7,8) via std::valarray what in my case leads to SSE instructions.

Remember that these are isolated tests and the results differ with compiler settings, from machine to machine or architecture to architecture. With caching (issues) and real world use-cases this may be completely different. So the theory can greatly differ from reality. The only way to find out is a practical test such as with google-benchmark[1] and checking the result of the compiler output for your particular problem solution[2].

  1. https://github.com/google/benchmark
  2. https://sourceware.org/binutils/docs/binutils/objdump.html -> objdump -S
  3. https://github.com/Jedzia/oglTemplate/blob/dd812b72d846ae888238d6f726d503485b796b68/benchmark/Playground/BM_FloatingPoint.cpp
Overrun answered 21/3, 2020 at 10:33 Comment(14)
Did you choose sizes that make float fit in some level of cache while double doesn't? If you were just bound on memory bandwidth in the same level of cache, you'd expect a simple factor of 2 difference in most. Or are more of those results for a single "vector" of 3 values stored contiguously, not in a SIMD-friendly way, and not amortized over a large array? So what kind of terrible asm did GCC make that led to copy taking a couple cycles for 3 floats but 10x that for 3 doubles?Chuffy
It's a very good observation, Peter. All theoretical explanations here are valid and good to know. My results are a special case of one setup of many different solutions possible. My point isn't how horrible my solution may be but that in praxis there are too much unknowns and you have to test your particular use-case to be sure. I appreciate your analysis. This helps me:) But lets focus on the question asked by the OP.Overrun
Ok, that's fair, demoing the fact that compilers can totally suck for no apparent reason when you change float to double is interesting. You should maybe point out that that's what your answer shows, not any fundamental issue or general case.Chuffy
The guilty one here is me, of course. With my devilish use of "volatile". The compiler has no chance to optimize anything, which was also my goal for this special case. So don't judge GCC to hard:)Overrun
To add some backstory: I was just as curious as the OP. Does using double instead of float make a difference? How I read the results: The first ones are to isolated and only the last two ones indicate what to expect in a real world case -> no difference. In my special case. Thanks to Corona i had the time to go down this rabbit-hole. This kind of investigation can add many hours and you have to decide on your own if it is practical. Let's say for a FPS improvement from 999 to 1177...Overrun
That's definitely getting into "irrelevant" territory, then. You only used volatile on the final result, so GCC could just compile them all to stores of compile-time-constant results. You didn't include the asm, and your code depends on some headers so it's not easy to look at how it compiled on godbolt.org; I guess I could clone your repo and compile it locally if I really wanted to, but IMO it's up to you at this point to demonstrate that your results mean anything.Chuffy
Your BM_DoubleCreation (and float) give us a baseline of an empty loop presumably running at 1 cycle per iteration; a volatile with no initializer still optimizes to zero asm instructions with GCC and clang.Chuffy
what to expect in a real world case - in many real world cases, you expect a factor of 2 from either memory bandwidth and/or being able to compute twice as many elements per SIMD vector. (addps and addpd have equal throughput for 16 bytes of FP data, but the ps version is 4 elements instead of 2.) So doing anything with SIMD-friendly arrays can usually benefit. See deplinenoise.wordpress.com/2015/03/06/… for more about SIMD-friendly data layout, i.e. arrays of x[], y[], z[], not packed xyz groups. stackoverflow.com/tags/sse/infoChuffy
Wonderful addition, Compiler Explorer is much more accessible and can provide a quick overview of simpler problems.Overrun
If you weren't aware of Godbolt, see How to remove "noise" from GCC/clang assembly output? for how to get simple readable optimized asm for a small function.Chuffy
Thanks, i was aware of it and him. Again my point: "Measure it, then you know it. Here are some tools, use it. It may look like this."Overrun
By the way, Peter: You are falling into a tirade of conclusions that can apply to my example. But also not. I keep it general and my point is: Measurement Is Knowledge! And I give an example of how one can measure. That we discovered that something is wrong with the "volatile" is just the beauty. The measured values provide information. Better than guessing, isn't it? I don't think that this is irrelevant at all. On the contrary. And that's a personal opinion. To devalue others for this is immature.Overrun
The question of the thread covers a huge area of possible CPUs and even raises the question of what it is like with other architectures, such as ARM, MCUs, etc. My answer to that: Don't ask, measure it yourself.Overrun
Better than guessing, isn't it? - yes, but only if you check what the compiler did so you know what you're measuring. Making conclusions based on microbenchmarks that measured something completely different from what you intended can be worse than realizing that something is unknown. But unfortunately that's all too easy when you need compilers to optimize like normal except for still doing some redundant work.Chuffy

© 2022 - 2024 — McMap. All rights reserved.