Why is operating on Float64 faster than Float16?
Asked Answered
B

2

38

I wonder why operating on Float64 values is faster than operating on Float16:

julia> rnd64 = rand(Float64, 1000);

julia> rnd16 = rand(Float16, 1000);

julia> @benchmark rnd64.^2
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.800 μs … 662.140 μs  ┊ GC (min … max):  0.00% … 99.37%
 Time  (median):     2.180 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.457 μs ±  13.176 μs  ┊ GC (mean ± σ):  12.34% ±  3.89%

  ▁██▄▂▂▆▆▄▂▁ ▂▆▄▁                                     ▂▂▂▁   ▂
  ████████████████▇▇▆▆▇▆▅▇██▆▆▅▅▆▄▄▁▁▃▃▁▁▄▁▃▄▁▃▁▄▃▁▁▆▇██████▇ █
  1.8 μs       Histogram: log(frequency) by time      10.6 μs <

 Memory estimate: 8.02 KiB, allocs estimate: 5.

julia> @benchmark rnd16.^2
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.117 μs … 587.133 μs  ┊ GC (min … max): 0.00% … 98.61%
 Time  (median):     5.383 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.716 μs ±   9.987 μs  ┊ GC (mean ± σ):  3.01% ±  1.71%

    ▃▅█▇▅▄▄▆▇▅▄▁             ▁                                ▂
  ▄██████████████▇▆▇▆▆▇▆▇▅█▇████▇█▇▇▆▅▆▄▇▇▆█▇██▇█▇▇▇▆▇▇▆▆▆▆▄▄ █
  5.12 μs      Histogram: log(frequency) by time      7.48 μs <

 Memory estimate: 2.14 KiB, allocs estimate: 5.

Maybe you ask why I expect the opposite: Because Float16 values have less floating point precision:

julia> rnd16[1]
Float16(0.627)

julia> rnd64[1]
0.4375452455597999

Shouldn't calculations with fewer precisions take place faster? Then I wonder why someone should use Float16? They can do it even with Float128!

Bussard answered 6/12, 2022 at 14:6 Comment(4)
There's hardware support for 32 & 64, but I think Float16 is converted before most operations: docs.julialang.org/en/v1/manual/… . On ARM processors (like an M1 mac) there is some support, e.g. @btime $(similar(rnd16)) .= 2 .* $rnd16; is faster than 64. This is quite recent, see e.g. github.com/JuliaLang/julia/issues/40216Grearson
@mcabbott, I somewhat guessed the conversion possibility in my mind. Thank you so much!Bussard
What CPU do you have? If it's x86, does it have AVX512-FP16 for direct support of fp16 without conversion, scalar and SIMD? (Sapphire Rapids and newer, and probably Alder Lake with unlocked AVX-512, unfortunately not Zen 4.) If not, most x86 CPUs for the last decade have instructions for packed conversion between fp16 and fp32, but that's it. Half-precision floating-point arithmetic on Intel chips. If your CPU doesn't even have F16C, it would take multiple instructions to convert.M16
Half precision floats are often used to save memory, not speed.Chapin
K
47

As you can see, the effect you are expecting is present for Float32:

julia> rnd64 = rand(Float64, 1000);

julia> rnd32 = rand(Float32, 1000);

julia> rnd16 = rand(Float16, 1000);

julia> @btime $rnd64.^2;
  616.495 ns (1 allocation: 7.94 KiB)

julia> @btime $rnd32.^2;
  330.769 ns (1 allocation: 4.06 KiB)  # faster!!

julia> @btime $rnd16.^2;
  2.067 μs (1 allocation: 2.06 KiB)  # slower!!

Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.

Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:

julia> @btime $rnd32.^2;
  336.187 ns (1 allocation: 4.06 KiB)

julia> @btime rnd32.^2;
  930.000 ns (5 allocations: 4.14 KiB)
Kamakura answered 6/12, 2022 at 15:9 Comment(1)
x86 since Ivy Bridge has had hardware support for conversion between FP16 and FP32, VCVTPH2PS YMM, XMM or VCVTPH2PS YMM, mem is still 2 uops on Intel. And converting back with a memory or register destination is 4 or 3 uops on Haswell (which is what that OP's 2013 CPU might be, or might be Ivy Bridge.) It the conversion uops also compete for limited back-end ports, port 1 both directions on Ivy Bridge and Haswell, plus the shuffle port (port 5) except for the memory-source version. It's an AVX instruction; IDK if Julia would use it automatically.M16
R
21

The short answer is that you probably shouldn't use Float16 unless you are using a GPU or an Apple CPU because (as of 2022) other processors don't have hardware support for Float16.

Repugn answered 6/12, 2022 at 14:33 Comment(4)
@JUL: Support didn't exist 9 years ago either.Ottoman
Not quite true that no other CPUs have support: Alder Lake with unlocked AVX-512 has AVX512-FP16 for have scalar and packed-SIMD support for FP16 (not just BF16). Also Sapphire Rapids Xeon, although that hasn't officially launched yet. See en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 for a table of extensions by CPU. And Half-precision floating-point arithmetic on Intel chips. But yes, no mainstream x86 CPUs with a launch date before 2023 have officially supported FP16 on the CPU, only iGPU.M16
I wouldn't say that you shouldn't use Float16 on other hardware. In a specialized circumstance where you're doing a bunch of number crunching, and don't require numbers bigger than 65504, don't require more than 3 decimal digits of precision, and don't require maximizing CPU speed, but you have massive arrays of these numbers and memory is at a premium, then using Float16 would be a useful optimization. OTOH, if you don't need a lot of memory but do need speed or accuracy, use Float64.Freud
Yeah, there are technically places where it can be useful, but there usually is some other form of memory consumption that will be faster at that point.Repugn

© 2022 - 2024 — McMap. All rights reserved.