Half-precision floating-point arithmetic on Intel chips
Asked Answered
J

2

20

Is it possible to perform half-precision floating-point arithmetic on Intel chips?

I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision floating-point numbers.

[1] https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats

Junie answered 24/4, 2018 at 7:19 Comment(0)
L
35

related: https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture - has some info about BFloat16 in Cooper Lake and Sapphire Rapids, and some non-Intel info.

Sapphire Rapids will have both BF16 and FP16, with FP16 using the same IEEE754 binary16 format as F16C conversion instructions, not brain-float. And AVX512-FP16 has support for most math operations, unlike BF16 which just has conversion to/from single and dot product accumulating pairs into single-precision.

This also applies to Alder Lake, on systems with the E cores disabled and AVX-512 specifically enabled in the BIOS (which apparently isn't officially supported as of now; only some mobo vendors have options for this.)

(The rest of the answer isn't updated for Sapphire Rapids / Alder Lake having FP16 / BF16.)


With the on-chip GPU

Is it possible to perform half-precision floating-point arithmetic on Intel chips?

Yes, apparently the on-chip GPU in Skylake and later has hardware support for FP16 and FP64, as well as FP32. With new enough drivers you can use it via OpenCL.

On earlier chips you get about the same throughput for FP16 vs. FP32 (probably just converting on the fly for nearly free), but on SKL / KBL chips you get about double the throughput of FP32 for GPGPU Mandelbrot (note the log-scale on the Mpix/s axis of the chart in that link).

The gain in FP64 (double) performance was huge, too, on Skylake iGPU.


With AVX / AVX-512 instructions

But on the IA cores (Intel-Architecture) no; even with AVX512 there's no hardware support for anything but converting them to single-precision. This saves memory bandwidth and can certainly give you a speedup if your code bottlenecks on memory. But it doesn't gain in peak FLOPS for code that's not bottlenecked on memory.

You could of course implement software floating point, possibly even in SIMD registers, so technically the answer is still "yes" to the question you asked, but it won't be faster than using the F16C VCVTPH2PS / VCVTPS2PH instructions + packed-single vmulps / vfmadd132ps HW support.

Use HW-supported SIMD conversion to/from float / __m256 in x86 code to trade extra ALU conversion work for reduced memory bandwidth and cache footprint. But if cache-blocking (e.g. for well-tuned dense matmul) or very high computational intensity means you're not memory bottlenecked, then just use float and save on ALU operations.


Upcoming: bfloat16 (Brain Float) and AVX512 BF16

A new 16-bit FP format with the same exponent range as IEEE binary32 has been developed for neural network use-cases. Compared to IEEE binary16 like x86 F16C conversion instructions use, it has much less significand precision, but apparently neural network code cares more about dynamic range from a large exponent range. This allows bfloat hardware not to even bother supporting subnormals.

Some upcoming Intel x86 CPU cores are will have HW support this format. The main use-case is still dedicated neural network accelerators (Nervana) and GPGPU type devices, but HW-supported conversion at least is very useful.

https://en.wikichip.org/wiki/brain_floating-point_format has more details, specifically that Cooper Lake Xeon and Core X CPUs are expected to support AVX512 BF16.

I haven't seen it mentioned for Ice Lake (Sunny Cove microarch). That could go either way, I wouldn't care to guess.

Intel® Architecture Instruction Set Extensions and Future Features Programming Reference revision -036 in April 2019 added details about BF16, including that it's slated for "Future, Cooper Lake". Once it's released, the documentation for the instructions will move to the main vol.2 ISA ref manual (and the pdf->HTML scrape at https://www.felixcloutier.com/x86/index.html).

https://github.com/HJLebbink/asm-dude/wiki has instructions from vol.2 and the future-extensions manual, so you can already find it there.

There are only 3 instructions: conversion to/from float, and a BF16 multiply + pairwise-accumulate into float. (First horizontal step of a dot-product.) So AVX512 BF16 does finally provide true computation for 16-bit floating point, but only in this very limited form that converts the result to float.

They also ignore MXCSR, always using the default rounding mode and DAZ/FTZ, and not setting any exception flags.

The other two don't support memory fault-suppression (when using masking with a memory source operand). Presumably because the masking is per destination element, and there are a different number of source elements. Conversion to BF16 apparently can suppress memory faults, because the same mask can apply to the 32-bit source elements as the 16-bit destination elements.

  • VCVTNE2PS2BF16 [xyz]mm1{k1}{z}, [xyz]mm2, [xyz]mm3/m512/m32bcst
    ConVerT (No Exceptions) 2 registers of Packed Single 2(to) BF16.
    _m512bh _mm512_cvtne2ps_pbh (__m512, __m512);

  • VDPBF16PS [xyz]mm1{k1}{z}, [xyz]mm2, [xyz]mm3/m512/m32bcst
    Dot Product of BF16 Pairs Accumulated into Packed Single Precision
    __m512 _mm512_dpbf16_ps(__m512, __m512bh, __m512bh); (Notice that even the unmasked version has a 3rd input for the destination accumulator, like an FMA).

      # the key part of the Operation section:
      t ← src2.dword[ i ]  (or  src.dword[0] for a broadcast memory source)
      srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+1]) * make_fp32(t.bfloat[1])
      srcdest.fp32[ i ] += make_fp32(src1.bfloat16[2*i+0]) * make_fp32(t.bfloat[0])
    

So we still don't get native 16-bit FP math that you can use for arbitrary things while keeping your data in 16-bit format for 32 elements per vector. Only FMA into 32-bit accumulators.


BTW, there are other real-number formats that aren't based on the IEEE-754 structure of fixed-width fields for sign/exponent/significand. One that's gaining popularity is Posit. https://en.wikipedia.org/wiki/Unum_(number_format), Beating Floating Point at its Own Game: Posit Arithmetic, and https://posithub.org/about

Instead of spending the whole significand coding space on NaNs, they use it for tapered / gradual overflow, supporting larger range. (And removing NaN simplifies the HW). IEEE floats only support gradual underflow (with subnormals), with hard overflow to +-Inf. (Which is usually an error/problem in real numerical simulations, not much different from NaN.)

The Posit encoding is sort of a variable width exponent, leaving more precision near 1.0. The goal is to allow using 32-bit or 16-bit precision in more cases (instead of 64 or 32) while still getting useful results for scientific computing / HPC, such as climate modeling. Double the work per SIMD vector, and half the memory bandwidth.

There have been some paper designs for Posit FPU hardware, but it's still early days yet and I think only FPGA implementations have really been built. Some Intel CPUs will come with onboard FPGAs (or maybe that's already a thing).

As of mid-2019 I haven't read about any Posit execution units as part of a commercial CPU design, and google didn't find anything.

Livery answered 24/4, 2018 at 9:18 Comment(14)
Zooming into the Mandelbrot set with half-precision is not going to go very deep. Using perturbation the limitation moves from the significant to the exponent. The exponent of half-precision is 2^-14 so you could zoom to about 10^-5 at twice the speed of single precision which can zoom to about 10^-38 with perturbation. Double to 10^-324 and using x87 long double down to 10^−4951. That's the only case I know of where x87 is still useful. Double-double and quad precision do not help because they don't change the exponent precision.Foretooth
@Zboson: GPU mandelbrot is presumably not about zooming or being useful, but rather just a well-known and simple problem with very high computational intensity / low memory bandwidth. (And a data dependency chain which could limit ILP). That page had some other benchmarks, too, but I like Mandelbrot.Livery
Peter, just in case you know, is there a performance benefit in loading/storing half floats to/from AVX units, while still processing in full float precision, assuming large matrix multiplication, as the most common example? In the first order approximation, this seems beneficial, as it essentially halves cache use and memory badnwidth. If you feel that it's worth a full answer in itself, not a short update, I'd be happy to post a separate Q.Stylize
@kkm: With proper cache-blocking (aka loop tiling), dense matmul isn't memory bound. It's ALU bound, and spending uops on f16 conversion would take cycles on the FMA ports. (And / or front-end bandwidth would be a problem, too, if you can't use a memory-source operand for FMA). In a badly optimized matmul that loads input data into L2 or L1d cache more than once, f16 might be an improvement. But with O(n^3) ALU work over O(n^2) data, it's generally possible to keep memory bandwidth down to O(n^2).Livery
Thank you! I admit most of this stuff flies way over my head (this is why I am using MKL for blas :) ), but the takeaway is the answer to my question is no.Stylize
@PeterCordes: Interesting. The Anandtech article, and the Intel document, suggest that BF16 only has conversion instructions and dot products.Valerie
@wim: Thanks, fixed. I hadn't looked up the manual yet and was being overly optimistic. :/Livery
I'm not familiar with neural network algorithms, but probably it is possible to express the majority of such computations in terms of VDPBF16PS, without needing too much adds or muls. Nice addition.Valerie
@bobcat: 1. that's paragraph is talking about the iGPU. 2. Sandybridge doesn't even support the F16C extension for AVX conversion to/from packed float. 3. software conversion of the inputs => matmul => software conversion of the result would probably be not as bad as 1000x slower, so badly optimized software (for that lack of HW support case) is apparently making things extra slow.Livery
@bobcat: added big headers to make that cheeky interpretation of "Intel chips" more obvious.Livery
It's my understanding that Intel's iGPUs are pretty weak. I wonder if using them for gpgpu is worthwhile. Apparently, MKL doesn't support themLakes
@bobcat: They're relatively weak, but IIRC better than the total FP throughput of the IA cores, on a quad-core desktop. I only mention them because the question is about hardware support for half-precision float "on Intel chips". Whether or not it's worth going out of your way to actually use the iGPU is a totally separate question that I wasn't trying to answer, not even implicitly by choosing to mention it.Livery
"even with AVX512 there's no hardware support for anything but converting them to single-precision" -- Strange... en.wikipedia.org/wiki/AVX-512 lists a bunch of FP16 arithmetic instructions.Lakes
@bobcat: That was true when this answer was new. Remember that AVX-512 isn't a monolithic thing. Instead of new numbers like AVX-512-2, they name new extensions like AVX-512VBMI, or AVX-512FP16. Sapphire Rapids hasn't released yet, so sometime I should get around to updating this answer for it, with that very new extension that adds native FP16 support, thanks for the reminder. And the AVX-512BF16 support that was still "upcoming" / on paper when I wrote this has now released in Cooper Lake. If you want to edit this answer, feel free to throw in a mention of that upcoming extension.Livery
C
4

If you are using all cores I would think that in many cases you are still memory bandwidth bound and half precision floating points would be a win.

Cheloid answered 26/6, 2019 at 10:27 Comment(1)
Yes, that's correct. Maybe that wasn't 100% clear from my answer, I'll reword it.Livery

© 2022 - 2024 — McMap. All rights reserved.