How does mtune actually work?
Asked Answered
P

1

13

There's this related question: GCC: how is march different from mtune?

However, the existing answers don't go much further than the GCC manual itself. At most, we get:

If you use -mtune, then the compiler will generate code that works on any of them, but will favour instruction sequences that run fastest on the specific CPU you indicated.

and

The -mtune=Y option tunes the generated code to run faster on Y than on other CPUs it might run on.

But exactly how does GCC favor one specific architecture, when bulding, while still being capable of running the build on other (usually older) architectures, albeit slower?

I only know of one thing (but I'm no computer scientist) which would be capable of such, and that's a CPU dispatcher. However, it doesn't seem (for me) that mtune is generating a dispatcher behind the scenes, and instead some other mechanism is probably in effect.

I feel that way for two reasons:

  1. Searching "gcc mtune cpu dispatcher" doesn't find anything relevant; and
  2. If it was based on dispatcher, I think it could be smarter (even if by some option other than mtune) and test for cpuid to detect supported instructions at runtime, instead of relying on a named architecture which is provided at build time.

So how does it work really?

Parricide answered 12/6, 2017 at 1:42 Comment(1)
@yugr it is definitely not a dupe. The question you linked, as well as the question the OP himself linked deal with understanding march vs mtune. While those questions show what mtune promises, this question specifically asks what the compiler can do to fulfill those promises.Juetta
E
23

-mtune doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.

From the GCC docs:

-mtune=cpu-type

        Tune to cpu-type everything applicable about the generated code, except for the ABI and the
        set of available instructions.

This means that GCC won't use instructions available only on cpu-type 1 but it will generate code that run optimally on cpu-type.

To understand this last statement is necessary to understand the difference between architecture and micro-architecture.
The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune.
The micro-architecture is how the architecture is implemented in hardware. For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation. This can go as far as having a code sequence being optimal only on one micro-architecture.

When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.
It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.

When we use -mtune=x we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.

As a concrete example consider how this code is compiled:

float bar(float a[4], float b[4])
{
    for (int i = 0; i < 4; i++)
    {
        a[i] += b[i];
    }

    float r=0;

    for (int i = 0; i < 4; i++)
    {
        r += a[i];
    }

    return r;
} 

The a[i] += b[i]; is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:

Skylake

    movups  xmm0, XMMWORD PTR [rsi]
    movups  xmm2, XMMWORD PTR [rdi]
    addps   xmm0, xmm2
    movups  XMMWORD PTR [rdi], xmm0
    movss   xmm0, DWORD PTR [rdi] 

Core2

    pxor    xmm0, xmm0
    pxor    xmm1, xmm1
    movlps  xmm0, QWORD PTR [rdi]
    movlps  xmm1, QWORD PTR [rsi]
    movhps  xmm1, QWORD PTR [rsi+8]
    movhps  xmm0, QWORD PTR [rdi+8]
    addps   xmm0, xmm1
    movlps  QWORD PTR [rdi], xmm0
    movhps  QWORD PTR [rdi+8], xmm0
    movss   xmm0, DWORD PTR [rdi]

The main difference is how an xmm register is loaded, on a Core2 it is loaded with two loads using movlps and movhps instead of using a single movups.
The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups is decoded into 4 uops and has a latency of 2 cycles while each movXps is 1 uop and 1 cycle of latency.
This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.
On Skylake the opposite is true: movups performs better than two movXps.

So we have to pick up one.
In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune.


1 Instruction set is selected with other switches.

Encephalography answered 12/6, 2017 at 13:49 Comment(7)
This shows how important are programmers with experience on this site. The explanation is spot on and your example is worth a thousand words. I don't usually leave +1 comments, but this truly deserves a "Great job!". Thank you!Juetta
Fantastic. So basically, mtune will not put instructions which are exclusive to the given architecture anywhere in the code... and exactly the same machine code will be executed no matter what cpu it's being run on... And the trick is to choose, among the available instructions (the "lowest common denominator", which are generic unless march is also used), which sequence of instructions are best implemented in terms of hardware in the CPU we specify?Parricide
@Marc.2377, it's not about instruction exclusivity, you can have 2 micro-architectures support the same ISA, but have them optimized differently, so for example a simple scalar addition would best be achieved with an add instruction on one, but a lea on the other (ignoring side effects for a second). The compiler would therefore pick the actual instruction based on the optimization target requested by -mtune. P.S. - great answer indeed!Razzia
@Parricide - more or less, but mtune doesn't actually preclude the use of a dispatcher. They are more or less orthogonal. As Margaret explains, mtune=X means "use the machine model for X to make optimization decisions", but still create code that runs based on the march argument. You can kind of imagine that mtune and march always have some value: even if they aren't specified on the command line, they take a default. So some compilers (uncommon) and libraries (common) like to use dispatch-based code, and can still occur if you specify mtune.Leund
Keep in mind also that mtune can indirectly affect the CPUs that your code runs on, depending on march. For example, you have have a -march value of Y (explicit or implicit) which means that the compiler can is allowed to generate code that runs on Y and later architecture Z (and so on) but may not run on earlier arch X. Note "may" - it may be the case that the code runs just fine on X because the compiler never wanted to use any instruction that is present on Y but not X. Specifying -mtune=blah, however, can change that and suddenly it doesn't run on X.Leund
gcc should use movsd instead of pxor + movlps to load the 64-bit low half and zero the upper half. Silly compiler :( Nice choice of example, though. Unaligned loads becoming cheap in more recent CPUs (and free when the data happen to be aligned) is an interesting thing. But Core2 doesn't just split 128 accesses. movaps is 1 uop. It's just that unaligned loads didn't have as much hardware support, so they always used multiple uops and couldn't be efficient in case the data did happen to be aligned at runtime. With more load-port hardware, they can be 1 uop in NHM and later.Orlena
@PeterCordes, great point, I added aligned_float and restrict which greatly cleans up the assembly and shows both solutions for core2 godbolt.org/z/DvvAg_Barrera

© 2022 - 2024 — McMap. All rights reserved.