Why does bfloat16 have so many exponent bits?
Asked Answered
W

1

8

It's clear why a 16-bit floating-point format has started seeing use for machine learning; it reduces the cost of storage and computation, and neural networks turn out to be surprisingly insensitive to numeric precision.

What I find particularly surprising is that practitioners abandoned the already-defined half-precision format in favor of one that allocates only 7 bits to the significand, but 8 bits to the exponent – fully as many as 32-bit FP. (wikipedia compares brain-float bfloat16 layout against IEEE binary16 and some 24-bit formats.)

Why so many exponent bits? So far, I have only found https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus

Based on our years of experience training and deploying a wide variety of neural networks across Google’s products and services, we knew when we designed Cloud TPUs that neural networks are far more sensitive to the size of the exponent than that of the mantissa. To ensure identical behavior for underflows, overflows, and NaNs, bfloat16 has the same exponent size as FP32. However, bfloat16 handles denormals differently from FP32: it flushes them to zero. Unlike FP16, which typically requires special handling via techniques such as loss scaling [Mic 17], BF16 comes close to being a drop-in replacement for FP32 when training and running deep neural networks.

I haven't run neural network experiments on anything like Google scale, but in such as I have run, a weight or activation with absolute value much greater than 1.0 means it's gone into the weeds, is going to spiral off into infinity, and the computer would be doing you a favor if it were to promptly crash with an error message. I have never seen or heard of any case that needs a dynamic range anything like the 1e38 of single-precision floating point.

So what am I missing?

Are there cases where neural networks really need huge dynamic range? If so, how, why?

Is there some reason why it is considered very beneficial for bfloat16 to use the same exponent as single precision, even though the significand is much smaller?

Or is it the case that the real goal was to shrink the significand to the absolute minimum that would do the job, in order to minimize the chip area and energy cost of the multipliers, being the most expensive part of an FPU; it so happened this turned out to be around 7 bits; the total size should be a power of 2 for alignment reasons; it would not quite fit in 8 bits; going up to 16, left surplus bits that might as well be used for something, and the most elegant solution was to keep the 8-bit exponent?

Wideawake answered 2/6, 2022 at 10:33 Comment(15)
I suspect it is for code simplification: one byte exponent, one byte significant/sign.Kislev
@chux-ReinstateMonica I can understand that perspective, but it does not sound quite right to me. The implementation is in hardware, and it seems to me that even were it in software, you would want it the other way around; software FP wants to put the sign with the exponent, not the significand?Wideawake
I think the 7-bit significant is encoded with an implied most significant bit. In software processing, code can then replace the sign bit with a true 1 as the most significant bit and continue. Although the implementation you use may use FP hardware making this irrelevant, the history of the format may have its roots in a software FP implementation.Kislev
@chux-ReinstateMonica Ah! Yes, okay, I see what you mean, that is a good point – if the implementation were in software. But in that case, you would arrange the format with the sign bit next to the significand, which I don't think is the case here. And even the choice to use bfloat16 at all, represents such an extreme optimization for hardware cost, that it seems unlikely for a slight simplification of a prototype software implementation, to have been a major influence.Wideawake
32-bit sample.Kislev
@chux-ReinstateMonica Good example! See, it has the sign bit next to the significand. Compare that to bfloat16 (shown in the page I linked), which has the sign bit as the MSB, not next to the significand.Wideawake
rwallace On further review, I think the excessive bits for exponent is simply to handle more range - as you say "surprisingly insensitive to numeric precision." and link has "neural networks are far more sensitive to the size of the exponent than that of the mantissa.". Good luck.Kislev
I assume this is mostly to simplify support on hardware which does not support FP16 natively. You just need to copy the upper 16 bits.Worried
@Worried Hmm, maybe you're right! Though if you do it that way, you always round toward zero (instead of to nearest even), which creates systematic bias. That can matter for some applications. I don't know whether it matters for neural networks.Wideawake
You write “I have never seen or heard of any case that needs a dynamic range anything like the 1e38” but the text you quote says “Based on our years of experience training and deploying a wide variety of neural networks across Google’s products and services, we knew when we designed Cloud TPUs that neural networks are far more sensitive to the size of the exponent than that of the mantissa,” so clearly those authors have had experiences you have not. Your lack of seeing such a case does not conflict with those authors having seen such cases. That speaks to your last two paragraphs…Fachan
… The answer to “Is there some reason why it is considered very beneficial for bfloat16 to use the same exponent as single precision, even though the significand is much smaller?” is yes, the reason is neural networks are sensitive to the size of the exponent. The answer to “Or is it the case that the real goal was to shrink…” is no, that was not a stated goal. It also answers “Are there cases where neural networks really need huge dynamic range?” with yes…Fachan
… That leaves just “If so, how, why?” unanswered. I do not have examples that would answer it; I am just showing the posted question can be reduced to this. You should probably edit the question to remove most of the text and change it to simply asking what are examples of neural networks that are sensitive to the exponent range to this degree.Fachan
@EricPostpischil: A big dynamic range may help to avoid overflow in temporaries. IIRC, neural network stuff often logarithms, so there are probably operations that multiply some numbers together as inputs for that. If you have code that's known to work (and not overflow anything to +Inf) with float, then keeping the same exponent range usually means you're ok with bfloat. Maybe? Not posting as an answer because this is conjecture based on vague memory of something I read a while ago.Cloudland
Did you ever get a satisfactory answer to this? In light of the patent battle google is now fighting against the alleged bf16 inventor, this question is hitting the spotlightVallie
@Vallie Nope! I didn't know about the patent battle.Wideawake
I
4

Collecting some of the discussion from the comments:

  • This greatly simplifies implementation on a system without hardware support for bfloat16, as the implementation can simply convert to and from IEEE single-precision by ignoring the last 16 bits.
  • Your quote from Shibo Wang and Pankaj Kanwar states that the inventors prioritized dynamic range over precision. This implementation preserves the dynamic range of a single-precision float, and cuts storage in half, by sacrificing precision.
  • Some implementations might be able to get good performance by representing the mantissas and exponents as 8-bit quantities (including the implicit leading 1 of the mantissa).
Inkerman answered 2/6, 2022 at 10:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.