Intel JCC Erratum - what is the effect of prefixes used for mitigation?

Intel recommends using instruction prefixes to mitigate the performance consequences of JCC Erratum.

MSVC if compiled with /QIntel-jcc-erratum follows the recommendation, and inserts prefixed instructions, like this:

3E 3E 3E 3E 3E 3E 3E 3E 3E 48 8B C8   mov rcx,rax ; with redundant 3E prefixes

They say MSVC resorts to NOPs when prefixes are not possible.

Clang has -mbranches-within-32B-boundaries option for this, and it prefers nop, multi-byte if needed (https://godbolt.org/z/399nc5Msq notice xchg ax, ax)

What are the consequences of 3E prefixes, specifically:

Why does Intel recommend this, and not multi-byte NOPs?
What are the consequences for unaffected CPUs?
Reportedly, a program runs faster with /QIntel-jcc-erratum on AMD, what could be possible explanations?

A NOP is a separate instruction that had to decode and go through the pipeline separately. It's always better to pad instructions with prefixes to achieve desired alignment, not insert NOPs, as discussed in What methods can be used to efficiently extend instruction length on modern x86? (but only in ways that don't cause major stalls on some CPUs which can't handle large numbers of prefixes).

Perhaps Intel considered it worth the effort for toolchains to do it this way for this case since this would actually be inside inner loops, not just a NOP outside an inner loop. (And tacking on prefixes to one previous instruction is relatively simple.)

I now have some data point. The result of benchmarking for /QIntel-jcc-erratum on AMD FX 8300 is bad.

The slowdown is by a decimal order of magnitude for a specific benchmark, where the benefit on Intel Skylake for the same benchmark is about 20 percent. This aligns with Peter's comments:

I checked Agner Fog's microarch guide, and AMD Zen has no problem with any number of prefixes on a single instruction, like mainstream Intel since Core2. AMD Bulldozer-family has a "very large" penalty for decoding instructions with more than 3 prefixes, like 14-15 cycles for 4-7 prefixes

It's somewhat valid to consider Bulldozer-family obsolete enough to not care much about it, although there are still some APU desktops and laptops around for sure, but they'd certainly show large regressions in loops where the compiler put 4 or more prefixes on one instruction inside a hot inner loop (including existing prefixes like REX or 66h). Much worse than the 3% for MITE legacy decode on SKL.

Though indeed Bulldozer-family is obsolete-ish, I don't think I can afford this much of an impact. I'm also afraid of other CPUs that may choke with extra prefixes the same way. So the conclusion for me is not to use /QIntel-jcc-erratum for generally-targeted software. Unless it is enabled in specific translation units and dynamic dispatch to there is made, which is too much of the trouble most of the time.

One thing that probably safe to do on MSVC is to stop using /Os flag . It was discovered that /Os flag at least:

Avoids jump tables in favor of conditional jumps
Avoids loop start padding

Try the following example (https://godbolt.org/z/jvezPd9jM):

void loop(int i, char a[], char b[])
{
    char* stop = a + i;
    while (a != stop){
        *b++ = *a++;
    }
}

void jump_table(int i, char a[], char b[])
{
    switch (i)
    {
                            case 7: 
            a[6] = b[6];    case 6: 
            a[5] = b[5];    case 5: 
            a[4] = b[4];    case 4: 
            a[3] = b[3];    case 3: 
            a[2] = b[2];    case 2: 
            a[1] = b[1];    case 1: 
            a[0] = b[1];    case 0:  break;
            default: __assume(false);
    }
}

This causes running into JCC perf issue more often (avoiding jump tables produces series of JCC, and avoiding alignment makes small loops less than 16b also sometimes touching the boundary)

Recommended topics

Hot tags