A NOP is a separate instruction that had to decode and go through the pipeline separately. It's always better to pad instructions with prefixes to achieve desired alignment, not insert NOPs, as discussed in What methods can be used to efficiently extend instruction length on modern x86? (but only in ways that don't cause major stalls on some CPUs which can't handle large numbers of prefixes).
Perhaps Intel considered it worth the effort for toolchains to do it this way for this case since this would actually be inside inner loops, not just a NOP outside an inner loop. (And tacking on prefixes to one previous instruction is relatively simple.)
I now have some data point. The result of benchmarking for /QIntel-jcc-erratum
on AMD FX 8300 is bad.
The slowdown is by a decimal order of magnitude for a specific benchmark, where the benefit on Intel Skylake for the same benchmark is about 20 percent. This aligns with Peter's comments:
I checked Agner Fog's microarch guide, and AMD Zen has no problem with any number of prefixes on a single instruction, like mainstream Intel since Core2. AMD Bulldozer-family has a "very large" penalty for decoding instructions with more than 3 prefixes, like 14-15 cycles for 4-7 prefixes
It's somewhat valid to consider Bulldozer-family obsolete enough to not care much about it, although there are still some APU desktops and laptops around for sure, but they'd certainly show large regressions in loops where the compiler put 4 or more prefixes on one instruction inside a hot inner loop (including existing prefixes like REX or 66h). Much worse than the 3% for MITE legacy decode on SKL.
Though indeed Bulldozer-family is obsolete-ish, I don't think I can afford this much of an impact. I'm also afraid of other CPUs that may choke with extra prefixes the same way. So the conclusion for me is not to use /QIntel-jcc-erratum
for generally-targeted software. Unless it is enabled in specific translation units and dynamic dispatch to there is made, which is too much of the trouble most of the time.
One thing that probably safe to do on MSVC is to stop using /Os
flag . It was discovered that /Os
flag at least:
- Avoids jump tables in favor of conditional jumps
- Avoids loop start padding
Try the following example (https://godbolt.org/z/jvezPd9jM):
void loop(int i, char a[], char b[])
{
char* stop = a + i;
while (a != stop){
*b++ = *a++;
}
}
void jump_table(int i, char a[], char b[])
{
switch (i)
{
case 7:
a[6] = b[6]; case 6:
a[5] = b[5]; case 5:
a[4] = b[4]; case 4:
a[3] = b[3]; case 3:
a[2] = b[2]; case 2:
a[1] = b[1]; case 1:
a[0] = b[1]; case 0: break;
default: __assume(false);
}
}
This causes running into JCC perf issue more often (avoiding jump tables produces series of JCC, and avoiding alignment makes small loops less than 16b also sometimes touching the boundary)
xchg ax,ax
is two byte nop: godbolt.org/z/399nc5Msq . Can't find gcc option, apparently it does not exist and gcc does not mitigate this issue. – Dyslexiaas
option (that you could get GCC to use with-Wa,-...
) GCC doesn't know instruction sizes, it only prints text. That's why it needs GAS to support stuff like.p2align 4,,10
to align by 16 if that will take fewer than 10 bytes of padding, to implement the alignment heuristics it wants to use. (Often followed by.p2align 3
to unconditionally align by 8.) – Areca/Os
optimizes for smaller code size. Among other things, it suppresses jump table generation, and loop aligning, so it asks for this issue to happen more often. Apparently there are no documented options in MSVC to have finer control, on each size/speed tradeoff. The good part is that removing/Os
is not expected to be harmful on AMD – Dyslexia/Os
tends to do worse, that's interesting. (A single indirect jmp is better than a sequence of cmp/jcc for avoiding this, though, and for reducing the necessary front-end throughput in case you do hit it. And modern CPUs have good indirect branch prediction.) – Areca