Intel JCC Erratum - what is the effect of prefixes used for mitigation?
Asked Answered
D

1

5

Intel recommends using instruction prefixes to mitigate the performance consequences of JCC Erratum.

MSVC if compiled with /QIntel-jcc-erratum follows the recommendation, and inserts prefixed instructions, like this:

3E 3E 3E 3E 3E 3E 3E 3E 3E 48 8B C8   mov rcx,rax ; with redundant 3E prefixes

They say MSVC resorts to NOPs when prefixes are not possible.

Clang has -mbranches-within-32B-boundaries option for this, and it prefers nop, multi-byte if needed (https://godbolt.org/z/399nc5Msq notice xchg ax, ax)

What are the consequences of 3E prefixes, specifically:

  • Why does Intel recommend this, and not multi-byte NOPs?
  • What are the consequences for unaffected CPUs?
  • Reportedly, a program runs faster with /QIntel-jcc-erratum on AMD, what could be possible explanations?
Dyslexia answered 3/12, 2021 at 15:27 Comment(12)
A NOP is a separate instruction that had to decode and go through the pipeline separately. What methods can be used to efficiently extend instruction length on modern x86?. You should always pad instructions with prefixes to achieve desired alignment, not insert NOPs. Probably Intel considered it worth the effort since this would actually be inside inner loops, not just a NOP outside an inner loop.Areca
But note that some CPUs don't efficiently decode more than 3 prefixes on one instruction, so that might be why this strategy for JCC-erratum mitigation isn't on by default. You'd want to distribute the padding over multiple previous instructions to avoid bottlenecks on Silvermont-family such as Gracemont (e.g. Alder Lake E-cores which have suddenly made that family a lot more mainstream-relevant). I forget about AMD decode limits.Areca
clang prefers NOPs, here xchg ax,ax is two byte nop: godbolt.org/z/399nc5Msq . Can't find gcc option, apparently it does not exist and gcc does not mitigate this issue.Dyslexia
IIRC, the GNU toolchain does mitigation in the assembler, so look for an as option (that you could get GCC to use with -Wa,-...) GCC doesn't know instruction sizes, it only prints text. That's why it needs GAS to support stuff like .p2align 4,,10 to align by 16 if that will take fewer than 10 bytes of padding, to implement the alignment heuristics it wants to use. (Often followed by .p2align 3 to unconditionally align by 8.)Areca
Found blog post where they mention size impact as 3% and perf impact as negligible: devblogs.microsoft.com/cppblog/jcc-erratum-mitigation-in-msvcDyslexia
That blog is saying that on affected CPUs (I think only Intel Skylake-family), using the compiler option makes the performance about the same as before the microcode update without the compiler option. It's not saying anything about its impact on other CPUs, like Silvermont/Goldmont. (I checked Agner Fog's microarch guide, and AMD Zen has no problem with any number of prefixes on a single instruction, like mainstream Intel since Core2. AMD Bulldozer-family has a "very large" penalty for decoding instructions with more than 3 prefixes, like 14-15 cycles for 4-7 prefixes.)Areca
They say: We also measured that the performance impact of /QIntel-jcc-erratum switch on processors that are not affected by the erratum is negligible. However, as codebases vary greatly, we advise developers to evaluate the impact of /QIntel-jcc-erratum in the context of their applications and workloads. however, the set of not affected processor to test is indeed not specified, and may be very limited, also: The content of this blog was provided by Gautham Beeraka from Intel Corporation.Dyslexia
Oh, I missed that part. I wouldn't be surprised if they only measured on other mainstream Intel CPUs, though, not AMD at all. (Given that "we" is Intel.) It's somewhat valid to consider Bulldozer-family obsolete enough to not care much about it, although there are still some APU desktops and laptops around for sure, but they'd certainly show large regressions in loops where the compiler put 4 or more prefixes on one instruction inside a hot inner loop (including existing prefixes like REX or 66h). Much worse than the 3% for MITE legacy decode on SKL.Areca
Agner says Goldmont / Goldmont can handle instructions with up to 4 prefixes, so that's a bit more capable than Silvermont. So perhaps there's been more improvement in Tremont, or at least in Alder Lake E-cores (Gracemont), if they tested those microarchitectures at all.Areca
@PeterCordes, I've expanded my answer to add another related MSVC option. /Os optimizes for smaller code size. Among other things, it suppresses jump table generation, and loop aligning, so it asks for this issue to happen more often. Apparently there are no documented options in MSVC to have finer control, on each size/speed tradeoff. The good part is that removing /Os is not expected to be harmful on AMDDyslexia
Aligning the tops of loops (by 8 or 16) isn't super closely related to whether the conditional branch at the bottom touches a 32B boundary. IDK what the distribution of loop sizes is in typical programs. If you've seen in practice that /Os tends to do worse, that's interesting. (A single indirect jmp is better than a sequence of cmp/jcc for avoiding this, though, and for reducing the necessary front-end throughput in case you do hit it. And modern CPUs have good indirect branch prediction.)Areca
I've noticed this for some amount of very small loops, beyond 16b, which are not prevalent, but mostly affected by the issue.Dyslexia
D
2

A NOP is a separate instruction that had to decode and go through the pipeline separately. It's always better to pad instructions with prefixes to achieve desired alignment, not insert NOPs, as discussed in What methods can be used to efficiently extend instruction length on modern x86? (but only in ways that don't cause major stalls on some CPUs which can't handle large numbers of prefixes).

Perhaps Intel considered it worth the effort for toolchains to do it this way for this case since this would actually be inside inner loops, not just a NOP outside an inner loop. (And tacking on prefixes to one previous instruction is relatively simple.)


I now have some data point. The result of benchmarking for /QIntel-jcc-erratum on AMD FX 8300 is bad.

The slowdown is by a decimal order of magnitude for a specific benchmark, where the benefit on Intel Skylake for the same benchmark is about 20 percent. This aligns with Peter's comments:

I checked Agner Fog's microarch guide, and AMD Zen has no problem with any number of prefixes on a single instruction, like mainstream Intel since Core2. AMD Bulldozer-family has a "very large" penalty for decoding instructions with more than 3 prefixes, like 14-15 cycles for 4-7 prefixes

It's somewhat valid to consider Bulldozer-family obsolete enough to not care much about it, although there are still some APU desktops and laptops around for sure, but they'd certainly show large regressions in loops where the compiler put 4 or more prefixes on one instruction inside a hot inner loop (including existing prefixes like REX or 66h). Much worse than the 3% for MITE legacy decode on SKL.

Though indeed Bulldozer-family is obsolete-ish, I don't think I can afford this much of an impact. I'm also afraid of other CPUs that may choke with extra prefixes the same way. So the conclusion for me is not to use /QIntel-jcc-erratum for generally-targeted software. Unless it is enabled in specific translation units and dynamic dispatch to there is made, which is too much of the trouble most of the time.


One thing that probably safe to do on MSVC is to stop using /Os flag . It was discovered that /Os flag at least:

  • Avoids jump tables in favor of conditional jumps
  • Avoids loop start padding

Try the following example (https://godbolt.org/z/jvezPd9jM):

void loop(int i, char a[], char b[])
{
    char* stop = a + i;
    while (a != stop){
        *b++ = *a++;
    }
}

void jump_table(int i, char a[], char b[])
{
    switch (i)
    {
                            case 7: 
            a[6] = b[6];    case 6: 
            a[5] = b[5];    case 5: 
            a[4] = b[4];    case 4: 
            a[3] = b[3];    case 3: 
            a[2] = b[2];    case 2: 
            a[1] = b[1];    case 1: 
            a[0] = b[1];    case 0:  break;
            default: __assume(false);
    }
}

This causes running into JCC perf issue more often (avoiding jump tables produces series of JCC, and avoiding alignment makes small loops less than 16b also sometimes touching the boundary)

Dyslexia answered 12/12, 2021 at 16:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.