Why is operating on Float64 faster than Float16?

Asked 6/12, 2022 at 14:6 Answered 6/12, 2022 at 15:9

Solved performance julia half-precision-float

I wonder why operating on Float64 values is faster than operating on Float16:

julia> rnd64 = rand(Float64, 1000);

julia> rnd16 = rand(Float16, 1000);

julia> @benchmark rnd64.^2
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.800 μs … 662.140 μs  ┊ GC (min … max):  0.00% … 99.37%
 Time  (median):     2.180 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.457 μs ±  13.176 μs  ┊ GC (mean ± σ):  12.34% ±  3.89%

  ▁██▄▂▂▆▆▄▂▁ ▂▆▄▁                                     ▂▂▂▁   ▂
  ████████████████▇▇▆▆▇▆▅▇██▆▆▅▅▆▄▄▁▁▃▃▁▁▄▁▃▄▁▃▁▄▃▁▁▆▇██████▇ █
  1.8 μs       Histogram: log(frequency) by time      10.6 μs <

 Memory estimate: 8.02 KiB, allocs estimate: 5.

julia> @benchmark rnd16.^2
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.117 μs … 587.133 μs  ┊ GC (min … max): 0.00% … 98.61%
 Time  (median):     5.383 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.716 μs ±   9.987 μs  ┊ GC (mean ± σ):  3.01% ±  1.71%

    ▃▅█▇▅▄▄▆▇▅▄▁             ▁                                ▂
  ▄██████████████▇▆▇▆▆▇▆▇▅█▇████▇█▇▇▆▅▆▄▇▇▆█▇██▇█▇▇▇▆▇▇▆▆▆▆▄▄ █
  5.12 μs      Histogram: log(frequency) by time      7.48 μs <

 Memory estimate: 2.14 KiB, allocs estimate: 5.

Maybe you ask why I expect the opposite: Because Float16 values have less floating point precision:

julia> rnd16[1]
Float16(0.627)

julia> rnd64[1]
0.4375452455597999

Shouldn't calculations with fewer precisions take place faster? Then I wonder why someone should use Float16? They can do it even with Float128!

Bussard answered 6/12, 2022 at 14:6 Comment(4)

There's hardware support for 32 & 64, but I think Float16 is converted before most operations: docs.julialang.org/en/v1/manual/… . On ARM processors (like an M1 mac) there is some support, e.g. @btime $(similar(rnd16)) .= 2 .* $rnd16; is faster than 64. This is quite recent, see e.g. github.com/JuliaLang/julia/issues/40216 – Grearson 6/12, 2022 at 14:20

@mcabbott, I somewhat guessed the conversion possibility in my mind. Thank you so much! – Bussard 6/12, 2022 at 15:8

What CPU do you have? If it's x86, does it have AVX512-FP16 for direct support of fp16 without conversion, scalar and SIMD? (Sapphire Rapids and newer, and probably Alder Lake with unlocked AVX-512, unfortunately not Zen 4.) If not, most x86 CPUs for the last decade have instructions for packed conversion between fp16 and fp32, but that's it. Half-precision floating-point arithmetic on Intel chips. If your CPU doesn't even have F16C, it would take multiple instructions to convert. – M16 7/12, 2022 at 3:6

Half precision floats are often used to save memory, not speed. – Chapin 12/12, 2022 at 17:19

As you can see, the effect you are expecting is present for Float32:

julia> rnd64 = rand(Float64, 1000);

julia> rnd32 = rand(Float32, 1000);

julia> rnd16 = rand(Float16, 1000);

julia> @btime $rnd64.^2;
  616.495 ns (1 allocation: 7.94 KiB)

julia> @btime $rnd32.^2;
  330.769 ns (1 allocation: 4.06 KiB)  # faster!!

julia> @btime $rnd16.^2;
  2.067 μs (1 allocation: 2.06 KiB)  # slower!!

Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.

Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:

julia> @btime $rnd32.^2;
  336.187 ns (1 allocation: 4.06 KiB)

julia> @btime rnd32.^2;
  930.000 ns (5 allocations: 4.14 KiB)

Kamakura answered 6/12, 2022 at 15:9 Comment(1)

x86 since Ivy Bridge has had hardware support for conversion between FP16 and FP32, VCVTPH2PS YMM, XMM or VCVTPH2PS YMM, mem is still 2 uops on Intel. And converting back with a memory or register destination is 4 or 3 uops on Haswell (which is what that OP's 2013 CPU might be, or might be Ivy Bridge.) It the conversion uops also compete for limited back-end ports, port 1 both directions on Ivy Bridge and Haswell, plus the shuffle port (port 5) except for the memory-source version. It's an AVX instruction; IDK if Julia would use it automatically. – M16 7/12, 2022 at 3:28

The short answer is that you probably shouldn't use Float16 unless you are using a GPU or an Apple CPU because (as of 2022) other processors don't have hardware support for Float16.

Repugn answered 6/12, 2022 at 14:33 Comment(4)

@JUL: Support didn't exist 9 years ago either. – Ottoman 7/12, 2022 at 0:57

Not quite true that no other CPUs have support: Alder Lake with unlocked AVX-512 has AVX512-FP16 for have scalar and packed-SIMD support for FP16 (not just BF16). Also Sapphire Rapids Xeon, although that hasn't officially launched yet. See en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 for a table of extensions by CPU. And Half-precision floating-point arithmetic on Intel chips. But yes, no mainstream x86 CPUs with a launch date before 2023 have officially supported FP16 on the CPU, only iGPU. – M16 7/12, 2022 at 3:16

I wouldn't say that you shouldn't use Float16 on other hardware. In a specialized circumstance where you're doing a bunch of number crunching, and don't require numbers bigger than 65504, don't require more than 3 decimal digits of precision, and don't require maximizing CPU speed, but you have massive arrays of these numbers and memory is at a premium, then using Float16 would be a useful optimization. OTOH, if you don't need a lot of memory but do need speed or accuracy, use Float64. – Freud 7/12, 2022 at 23:24

Yeah, there are technically places where it can be useful, but there usually is some other form of memory consumption that will be faster at that point. – Repugn 8/12, 2022 at 5:15

Recommended topics

Hot tags