-mtune
doesn't create a dispatcher, it doesn't need one: we are already telling the compiler what architecture we are targeting.
From the GCC docs:
-mtune=cpu-type
Tune to cpu-type everything applicable about the generated code, except for the ABI and the
set of available instructions.
This means that GCC won't use instructions available only on cpu-type 1 but it will generate code that run optimally on cpu-type.
To understand this last statement is necessary to understand the difference between architecture and micro-architecture.
The architecture implies an ISA (Instruction Set Architecture) and that's not influenced by the -mtune
.
The micro-architecture is how the architecture is implemented in hardware.
For an equal instruction set (read: architecture), a code sequence may run optimally on a CPU (read micro-architecture) but not on another due to the internal details of the implementation.
This can go as far as having a code sequence being optimal only on one micro-architecture.
When generating the machine code often GCC has a degree of freedom in choosing how to order the instructions and what variant to use.
It will use a heuristic to generate a sequence of instructions that run fast on the most common CPUs, sometime it will sacrifice a 100% optimal solution for CPU x if that will penalise CPUs y, z and w.
When we use -mtune=x
we are fine tuning the output of GCC for CPU x thereby producing a code that is 100% optimal (from the GCC perspective) on that CPU.
As a concrete example consider how this code is compiled:
float bar(float a[4], float b[4])
{
for (int i = 0; i < 4; i++)
{
a[i] += b[i];
}
float r=0;
for (int i = 0; i < 4; i++)
{
r += a[i];
}
return r;
}
The a[i] += b[i];
is vectorised (if the vectors don't overlap) differently when targeting a Skylake or a Core2:
Skylake
movups xmm0, XMMWORD PTR [rsi]
movups xmm2, XMMWORD PTR [rdi]
addps xmm0, xmm2
movups XMMWORD PTR [rdi], xmm0
movss xmm0, DWORD PTR [rdi]
Core2
pxor xmm0, xmm0
pxor xmm1, xmm1
movlps xmm0, QWORD PTR [rdi]
movlps xmm1, QWORD PTR [rsi]
movhps xmm1, QWORD PTR [rsi+8]
movhps xmm0, QWORD PTR [rdi+8]
addps xmm0, xmm1
movlps QWORD PTR [rdi], xmm0
movhps QWORD PTR [rdi+8], xmm0
movss xmm0, DWORD PTR [rdi]
The main difference is how an xmm
register is loaded, on a Core2 it is loaded with two loads using movlps
and movhps
instead of using a single movups
.
The two loads approach is better on a Core2 micro-architecture, if you take a look at the Agner Fog's instructions tables you'll see that movups
is decoded into 4 uops and has a latency of 2 cycles while each movXps
is 1 uop and 1 cycle of latency.
This is probably due to the fact that 128-bit accesses were split into two 64-bit accesses at the time.
On Skylake the opposite is true: movups
performs better than two movXps
.
So we have to pick up one.
In general, GCC picks up the first variant because Core2 is an old micro-architecture, but we can override this with -mtune
.
1 Instruction set is selected with other switches.
march
vsmtune
. While those questions show whatmtune
promises, this question specifically asks what the compiler can do to fulfill those promises. – Juetta