How does mtune actually work?

There's this related question: GCC: how is march different from mtune?

However, the existing answers don't go much further than the GCC manual itself. At most, we get:

If you use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated.

and

The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on.

But exactly how does GCC favor one specific architecture, when bulding, while still being capable of running the build on other (usually older) architectures, albeit slower?

I only know of one thing (but I'm no computer scientist) which would be capable of such, and that's a CPU dispatcher. However, it doesn't seem (for me) that mtune is generating a dispatcher behind the scenes, and instead some other mechanism is probably in effect.

I feel that way for two reasons:

Searching "gcc mtune cpu dispatcher" doesn't find anything relevant; and
If it was based on dispatcher, I think it could be smarter (even if by some option other than mtune) and test for cpuid to detect supported instructions at runtime, instead of relying on a named architecture which is provided at build time.

So how does it work really?

-mtune doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.

From the GCC docs:

-mtune=cpu-type

Tune to cpu-type everything applicable about the generated code, except for the ABI and the
set of available instructions.

This means that GCC won't use instructions available only on cpu-type ¹ but it will generate code that run optimally on cpu-type.

To understand this last statement is necessary to understand the difference between architecture and micro-architecture.
The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune.
The micro-architecture is how the architecture is implemented in hardware. For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation. This can go as far as having a code sequence being optimal only on one micro-architecture.

When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.
It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.

When we use -mtune=x we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.

As a concrete example consider how this code is compiled:

float bar(float a[4], float b[4])
{
    for (int i = 0; i < 4; i++)
    {
        a[i] += b[i];
    }

    float r=0;

    for (int i = 0; i < 4; i++)
    {
        r += a[i];
    }

    return r;
}

The a[i] += b[i]; is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:

Skylake

    movups  xmm0, XMMWORD PTR [rsi]
    movups  xmm2, XMMWORD PTR [rdi]
    addps   xmm0, xmm2
    movups  XMMWORD PTR [rdi], xmm0
    movss   xmm0, DWORD PTR [rdi]

Core2

    pxor    xmm0, xmm0
    pxor    xmm1, xmm1
    movlps  xmm0, QWORD PTR [rdi]
    movlps  xmm1, QWORD PTR [rsi]
    movhps  xmm1, QWORD PTR [rsi+8]
    movhps  xmm0, QWORD PTR [rdi+8]
    addps   xmm0, xmm1
    movlps  QWORD PTR [rdi], xmm0
    movhps  QWORD PTR [rdi+8], xmm0
    movss   xmm0, DWORD PTR [rdi]

The main difference is how an xmm register is loaded, on a Core2 it is loaded with two loads using movlps and movhps instead of using a single movups.
The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups is decoded into 4 uops and has a latency of 2 cycles while each movXps is 1 uop and 1 cycle of latency.
This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.
On Skylake the opposite is true: movups performs better than two movXps.

So we have to pick up one.
In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune.

¹ Instruction set is selected with other switches.

Recommended topics

Hot tags