There isn't a single "intel CPU", especially in terms of what operations are optimized with respect to others!, but most of them, at CPU level (specifically within the FPU), are such that the answer to your question:
are double operations just as fast or
faster than float operations for +, -,
*, and /?
is "yes" -- within the CPU, except for division and sqrt which are somewhat slower for double
than for float
. (Assuming your compiler uses SSE2 for scalar FP math, like all x86-64 compilers do, and some 32-bit compilers depending on options. Legacy x87 doesn't have different widths in registers, only in memory (it converts on load/store), so historically even sqrt and division were just as slow for double
).
For example, Haswell has a divsd
throughput of one per 8 to 14 cycles (data-dependent), but a divss
(scalar single) throughput of one per 7 cycles. x87 fdiv
is 8 to 18 cycle throughput. (Numbers from https://agner.org/optimize/. Latency correlates with throughput for division, but is higher than the throughput numbers.)
The float
versions of many library functions like logf(float)
and sinf(float)
will also be faster than log(double)
and sin(double)
, because they have many fewer bits of precision to get right. They can use polynomial approximations with fewer terms to get full precision for float
vs. double
However, taking up twice the memory for each number clearly implies heavier load on the cache(s) and more memory bandwidth to fill and spill those cache lines from/to RAM; the time you care about performance of a floating-point operation is when you're doing a lot of such operations, so the memory and cache considerations are crucial.
@Richard's answer points out that there are also other ways to perform FP operations (the SSE / SSE2 instructions; good old MMX was integers-only), especially suitable for simple ops on lot of data ("SIMD", single instruction / multiple data) where each vector register can pack 4 single-precision floats or only 2 double-precision ones, so this effect will be even more marked.
In the end, you do have to benchmark, but my prediction is that for reasonable (i.e., large;-) benchmarks, you'll find advantage to sticking with single precision (assuming of course that you don't need the extra bits of precision!-).