Performance penalty: denormalized numbers versus branch mis-predictions

Asked 1/4, 2020 at 11:32 Answered 1/4, 2020 at 12:24

Solved c++x86 floating-point micro-optimization branch-prediction

For those that have already measured or have deep knowledge about this kind of considerations, assume that you have to do the following (just to pick any for the example) floating-point operator:

float calc(float y, float z)
{ return sqrt(y * y + z * z) / 100; }

Where y and z could be denormal numbers, let's assume two possible situations where just y, just z, or maybe both, in a totally random manner, can be denormal numbers

50% of the time
<1% of the time

And now assume I want to avoid the performance penalty of dealing with denormal numbers and I just want to treat them as 0, and I change that piece of code by:

float calc(float y, float z)
{
   bool yzero = y < 1e-37;
   bool zzero = z < 1e-37;
   bool all_zero = yzero and zzero;
   bool some_zero = yzero != zzero;

   if (all_zero)
      return 0f;

   float ret;

   if (!some_zero) ret = sqrt(y * y + z * z);
   else if (yzero) ret = z;
   else if (zzero) ret = y;

   return ret / 100;
}

What will be worse, the performarce penalty for branch misprediction (for the 50% or <1% cases), or the performance penalty for working with denormal numbers?

To properly interpret which operations can be normal or denormal in the previous piece of code I would like as well to get some one-lined but totally optional answers about the following closely related questions:

float x = 0f; // Will x be just 0 or maybe some number like 1e-40;
float y = 0.; // I assume the conversion is just thin-air here and the compiler will see just a 0.
0; // Is "exact zero" a normal or a denormal number?
float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?
float zz = x / c; // What about a "no-op" operating against any compiler-time constant?
bool yzero = y < 1e-37; // Have comparisions any performance penalty when y is denormal or they don't?

Ambrosio answered 1/4, 2020 at 11:32 Comment(9)

The only real answer is to measure. – Sinclair 1/4, 2020 at 11:36

On what CPU? IIRC, AMD CPUs have no penalty for subnormal inputs/results, while modern Intel CPUs (Sandybridge-family) handle some but not all FP operations on subnormal operands without needing a microcode assist (over 100 cycles, vs. ~10 to 20 for a branch miss). See Agner Fog's microarch PDF for some info; he mentions this in general without a fully detailed breakdown. I don't think uops.info tests for normal vs. subnormal unfortunately. – Occupancy 1/4, 2020 at 11:50

Your example function will result in inaccurate results, way before y or z is subnormal (as soon as either variable squared is zero). Besides that, your question needs much more context (e.g, what platform, are you concerned about throughput or latency?) – Insomuch 1/4, 2020 at 11:50

I don't know the details for any non-x86 microarchitectures, like ARM cortex-a76 or any RISC-V to pick a couple random examples that might also be relevant. Mispredict penalties vary wildly as well, across simple in-order pipelines vs. deep OoO exec CPUs like modern x86. True mispredict penalty also depends on surrounding code. – Occupancy 1/4, 2020 at 11:51

@Insomuch It was a kind-of a theoretical question just for getting more knowledge, but if results can vary so widely, x86 microarchitectures will be my main use case. – Ambrosio 1/4, 2020 at 11:52

@PeterCordes So the efficiency penalty between branch-misprediction and denormal computations are in more or less a same order of magnitude in "number of cycles" so the difference highly depends on actual source code context? – Ambrosio 1/4, 2020 at 11:55

instead of ret = sqrt(y * y + z * z); you can use ret = std::hypot(y, z); which avoids underflow and overflow – Dormitory 1/4, 2020 at 13:9

@Dormitory I guarantee hypot is a lot slower as it has to perform exact fma full precision and adding a lib function call will significantly slow down the surrounding code as it will force stack saving and restoring of the comm registers. Whatever penalty there is from subnormal will surely be less than the penalty of hypot – Trunkfish 16/3 at 0:21

@JackG what do you mean by comm registers? Regarding FPU/SIMD registers modern Linux doesn't do lazy restore anymore and always stores them on every context switch because modern apps use them a lot even in non-float non-simd contexts for things like memcpy, memcmp, strchr, strlen... Even if you don't use them directly then the standard functions under the hood might use them indirectly. So your point is moot. Anyway std::hypot definitely helps avoiding denormalized values which may be much slower than std::hypot itself – Dormitory 20/3 at 10:40

There's HW support for this for free in many ISAs including x86, see below re: FTZ / DAZ. Most compilers set those flags during startup when you compile with -ffast-math or equivalent.

Also note that your code fails to avoid the penalty (on HW where there is any) in some cases: y * y or z * z can be subnormal for small but normalized y or z. (Good catch, @chtz). The exponent of y*y is twice the exponent of y, more negative or more positive. With 23 explicit mantissa bits in a float, that's about 12 exponent values that are the square roots of subnormal values, and wouldn't underflow all the way to 0.

Squaring a subnormal always gives underflow to 0; subnormal input may be less likely to have a penalty than subnormal output for a multiply, I don't know. Having a subnormal penalty or not can vary by operation within one microarchitecture, like add/sub vs. multiply vs. divide.

Also, any negative y or z gets treated as 0, which is probably a bug unless your inputs are known non-negative.

if results can vary so widely, x86 microarchitectures will be my main use case

Yes, penalties (or lack thereof) vary greatly.

Historically (P6-family) Intel used to always take a very slow microcode assist for subnormal results and subnormal inputs, including for compares. Modern Intel CPUs (Sandybridge-family) handle some but not all FP operations on subnormal operands without needing a microcode assist. (perf event fp_assists.any)

The microcode assist is like an exception and flushes the out-of-order pipeline, and takes over 160 cycles on SnB-family, vs. ~10 to 20 for a branch miss. And branch misses have "fast recovery" on modern CPUs. True branch-miss penalty depends on surrounding code; e.g. if the branch condition is really late to be ready it can result in discarding a lot of later independent work. But a microcode assist is still probably worse if you expect it to happen frequently.

Note that you can check for a subnormal using integer ops: just check the exponent field for all zero (and the mantissa for non-zero: the all-zero encoding for 0.0 is technically a special case of a subnormal). So you could manually flush to zero with integer SIMD operations like andps/pcmpeqd/andps

Agner Fog's microarch PDF has some info; he mentions this in general without a fully detailed breakdown for each uarch. I don't think https://uops.info/ tests for normal vs. subnormal unfortunately.

Knight's Landing (KNL) only has subnormal penalties for division, not add / mul. Like GPUs, they took an approach that favoured throughput over latency and have enough pipeline stages in their FPU to handle subnormals in the hardware equivalent of branchlessly. Even though this might mean higher latency for every FP operation.

AMD Bulldozer / Piledriver have a ~175 cycle penalty for results that are "subnormal or underflow", unless FTZ is set. Agner doesn't mention subnormal inputs. Steamroller/Excavator don't have any penalties.

AMD Ryzen (from Agner Fog's microarch pdf)

Floating point operations that give a subnormal result take a few clock cycles extra. The same is the case when a multiplication or division underflows to zero. This is far less than the high penalty on the Bulldozer and Piledriver. There is no penalty when flush-to-zero mode and denormals-are-zero mode are both on.

By contrast, Intel Sandybridge-family (at least Skylake) doesn't have penalties for results that underflow all the way to 0.0.

Intel Silvermont (Atom) from Agner Fog's microarch pdf

Operations that have subnormal numbers as input or output or generate underflow take approximately 160 clock cycles unless the flush-to-zero mode and denormals-are-zero mode are both used.

This would include compares.

And now assume I want to avoid the performance penalty of dealing with denormal numbers and I just want to treat them as 0

Then you should set your FPU to do that for you for free, removing all possibility of penalties from subnormals.

Some / most(?) modern FPUs (including x86 SSE but not legacy x87) let you treat subnormals (aka denormals) as zero for free, so this problem only occurs if you want this behaviour for some functions but not all, within the same thread. And with too fine-grained switching to be worth changing the FP control register to FTZ and back.

Or could be relevant if you wanted to write fully portable code that was terrible nowhere, even if it meant ignoring HW support and thus being slower than it could be.

Some x86 CPUs do even rename MXCSR so changing the rounding mode or FTZ/DAZ might not have to drain the out-of-order back-end. It's still not cheap and you'd want to avoid doing it every few FP instructions.

ARM also supports a similar feature: subnormal IEEE 754 floating point numbers support on iOS ARM devices (iPhone 4) - but apparently the default setting for ARM VFP / NEON is to treat subnormals as zero, favouring performance over strict IEEE compliance.

See also flush-to-zero behavior in floating-point arithmetic about cross-platform availability of this.

On x86 the specific mechanism is that you set the DAZ and FTZ bits in the MXCSR register (SSE FP math control register; also has bits for FP rounding mode, FP exception masks, and sticky FP masked-exception status bits). https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz shows the layout and also discusses some performance effects on older Intel CPUs. Lots of good background / introduction.

Compiling with -ffast-math will link in some extra startup code that sets FTZ/DAZ before calling main. IIRC, threads inherit the MXCSR settings from the main thread on most OSes.

DAZ = Denormals Are Zero, treats input subnormals as zero. This affects compares (whether or not they would have experienced a slowdown) making it impossible to even tell the difference between 0 and a subnormal other than using integer stuff on the bit-pattern.
FTZ = Flush To Zero, subnormal outputs from calculations are just underflowed to zeroed. i.e. disable gradual underflow. (Note that multiplying two small normal numbers can underflow. I think add/sub of normal numbers whose mantissas cancel out except for the low few bits could produce a subnormal as well.)

Usually you simply set both or neither. If you're processing input data from another thread or process, or compile-time constants, you could still have subnormal inputs even if all results you produce are normalized or 0.

Specific random questions:

float x = 0f; // Will x be just 0 or maybe some number like 1e-40;

This is a syntax error. Presumably you mean 0.f or 0.0f

0.0f is exactly representable (with the bit-pattern 0x00000000) as an IEEE binary32 float, so that's definitely what you will get on any platform that uses IEEE FP. You won't randomly get subnormals that you didn't write.

float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?

No, IEEE754 doesn't allow 0.0 / 1.0 to give anything other than 0.0.

Again, subnormals don't appear out of thin air. Rounding "error" only happens when the exact result can't be represented as a float or double. The max allowed error for the IEEE "basic" operations (* / + - and sqrt) is 0.5 ulp, i.e. the exact result must be correctly rounded to the nearest representable FP value, right down to the last digit of the mantissa.

 bool yzero = y < 1e-37; // Have comparisons any performance penalty when y is denormal or they don't?

Maybe, maybe not. No penalty on recent AMD or Intel, but is slow on Core 2 for example.

Note that 1e-37 has type double and will cause promotion of y to double. You might hope that this would actually avoid subnormal penalties vs. using 1e-37f. Subnormal float->int has no penalty on Core 2, but unfortunately cvtss2sd does still have the large penalty on Core 2. (GCC/clang don't optimize away the conversion even with -ffast-math, although I think they could because 1e-37 is exactly representable as a flat, and every subnormal float can be exactly represented as a normalized double. So the promotion to double is always exact and can't change the result).

On Intel Skylake, comparing two subnormals with vcmplt_oqpd doesn't result in any slowdown, and not with ucomisd into integer FLAGS either. But on Core 2, both are slow.

Comparison, if done like subtraction, does have to shift the inputs to line up their binary place-values, and the implied leading digit of the mantissa is a 0 instead of 1 so subnormals are a special case. So hardware might choose to not handle that on the fast path and instead take a microcode assist. Older x86 hardware might handle this slower.

It could be done differently if you built a special compare ALU separate from the normal add/sub unit. Float bit-patterns can be compared as sign/magnitude integers (with a special case for NaN) because the IEEE exponent bias is chosen to make that work. (i.e. nextafter is just integer ++ or -- on the bit pattern). But this apparently isn't what hardware does.

FP conversion to integer is fast even on Core 2, though. cvt[t]ps2dq or the pd equivalent convert packed float/double to int32 with truncation or the current rounding mode. So for example this recent proposed LLVM optimization is safe on Skylake and Core 2, according to my testing.

Also on Skylake, squaring a subnormal (producing a 0) has no penalty. But it does have a huge penalty on Conroe (P6-family).

But multiplying normal numbers to produce a subnormal result has a penalty even on Skylake (~150x slower).

Occupancy answered 1/4, 2020 at 12:24 Comment(6)

So, theoretically speaking, and after reading the two principal facts there (100 cycles denormal versus 20-30 misprediction average), plus the fact that comparing a denormal is a denormal op, the first version will be always faster unless both operands are denormals, in whose case the first version will have 5 denormal operations, while the second version only 2. Also, the last three branches (if(!some_zero)...) are usually conditional moves so I've no penaly here. Am I right here? – Ambrosio 1/4, 2020 at 13:14

Ok I forgot the sqrt that is also a factor here, and eluding them is only a gainer when any of the operands is denormal. – Ambrosio 1/4, 2020 at 13:22

@Peregring-lk: if (!some_zero) ret = sqrt(y * y + z * z); can only be branchless if you actually compute that result! The whole point of this is to avoid doing those FP operations at all in case there are input subnormals. A compiler would likely transform your boolean-setting and if() operations into simpler branching, like at most 3 total, or maybe branchlessly choosing between y and z (e.g. legacy x87+P6 fcmov) then branch on them both non-zero. Note that true legacy x87 didn't have FP conditional moves. Branchless SSE math can be done with compare-into-mask and ANDPS/ORPS... – Occupancy 1/4, 2020 at 13:36

@Peregring-lk: See also my last edit: maybe you missed that some CPUs have subnormal penalties for some operations (mul) but not others (add or compare). Possibly your simple model could work for earlier P6-family CPUs if any operation on a subnormal input always has a penalty. You seem to have raised my mispredict penalty cost from 10-20 to 20-30. It can be effectively cheaper in code that's not front-end bottlenecked, if the branch condition is ready nice and early... It's not simple to mode on an OoO exec CPU. Performance isn't 1-dimensional so you can't just add costs to get a total. – Occupancy 1/4, 2020 at 13:39

Minor additions/comments: float x = 0f; is illegal, you need to write 0.f or 0e0f or something. And: bool yzero = y < 1e-37; this will likely convert y to double before comparing, this should probably better be y < 1e-37f. – Insomuch 1/4, 2020 at 22:45

@chtz: I hoped float->double conversion might side-step the subnormal penalty by cheaply producing normalized doubles. It does not on Core 2, so yes, y < 1e-37f would be better, but still basically useless. Squaring a subnormal to produce 0.0f is no worse than doing a compare on some CPUs. Although perhaps not AMD, given what Agner says where even underflow to 0 is expensive. Anyway updated, and tested cvtps2pd on my SKL and Core 2. Fast on SKL, slow on Core 2 with subnormals. – Occupancy 2/4, 2020 at 5:14

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Specific random questions:

Recommended topics

Hot tags