Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?
Asked Answered
F

2

6

I'm using godbolt to get assembly of the following program:

#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
    res = a * 36;
    return 1;
}

If I use -Os optimization, the generated code is natural:

mov     eax, DWORD PTR a[rip]
imul    eax, eax, 36
mov     DWORD PTR res[rip], eax

But if I use -O2, the generated code is this:

mov     eax, DWORD PTR a[rip]
lea     eax, [rax+rax*8]
sal     eax, 2
mov     DWORD PTR res[rip], eax

So instead of multiplying 5*36, it does 5 -> 5+5*8=45 -> 45*4 = 180. I assume this is because 1 imul is slower than 1 lea + 1 shift left.

But in the lea instruction, it needs to calculate rax+rax*8, which contains 1 addition + 1 mul. So why is it still faster than just 1 imul? Is it because memory addressing inside lea is free?

Edit 1: also, how does [rax + rax*8] get translated into machine code? Does it gets compiled down to additional 2 instructions (shl, rbx, rax, 3; add rax, rax, rbx;), or something else?

Edit 2: Surprising results below. I make a loop, then generate code using -O2, then copy the file and replace the segment above with code from -Os. So 2 assembly files are the same everywhere, except for the instructions we're benchmarking. Running on Windows, the commands are

gcc mul.c -O2 -S -masm=intel -o mulo2.s 
gcc mulo2.s -o mulo2
// replace line of code in mulo2.s, save as muls.s
gcc muls.s -o muls
cmd /v:on /c "echo !time! & START "TestAgente" /W mulo2 & echo !time!"
cmd /v:on /c "echo !time! & START "TestAgente" /W muls & echo !time!"

#include <stdio.h>

volatile int a = 5;
volatile int res = 0;

int main() {
    size_t LOOP = 1000 * 1000 * 1000;
    LOOP = LOOP * 10;
    size_t i = 0;
    while (i < LOOP) {
      i++;
      res = a * 36;
    }

    return 0;
}

; mulo2.s
    .file   "mul.c"
    .intel_syntax noprefix
    .text
    .def    __main; .scl    2;  .type   32; .endef
    .section    .text.startup,"x"
    .p2align 4
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    sub rsp, 40
    .seh_stackalloc 40
    .seh_endprologue
    call    __main
    movabs  rdx, 10000000000
    .p2align 4,,10
    .p2align 3
.L2:
    mov eax, DWORD PTR a[rip]
    lea eax, [rax+rax*8] ; replaces these 2 lines with
    sal eax, 2           ; imul eax, eax, 36
    mov DWORD PTR res[rip], eax
    sub rdx, 1
    jne .L2
    xor eax, eax
    add rsp, 40
    ret
    .seh_endproc
    .globl  res
    .bss
    .align 4
res:
    .space 4
    .globl  a
    .data
    .align 4
a:
    .long   5
    .ident  "GCC: (GNU) 9.3.0"

Surprisingly, the result is that the -Os version is consistently faster than -O2 (4.1s vs 5s average, Intel 8750H CPU, each .exe file is run several times). So in this case, the compiler has optimized wrongly. Could someone provide a new explanation given this benchmark?

Edit 3: To measure the effects of instruction cache line, here's a python script to generate different addresses for the main loop by adding nop instructions to the program right before the main loop. It's for Window, for Linux it just needs to be modified a bit.

#cd "D:\Learning\temp"
import os
import time
import datetime as dt

f = open("mulo2.s","r")
lines = [line for line in f]
f.close()

def addNop(cnt, outputname):
    f = open(outputname, "w")
    for i in range(17):
        f.write(lines[i])
    for i in range(cnt):
        f.write("\tnop\n")
    for i in range(17, len(lines)):
        f.write(lines[i])
    f.close()

if os.path.isdir("nop_files")==False:
    os.mkdir("nop_files")
MAXN = 100
for t in range(MAXN+1):
    sourceFile = "nop_files\\mulo2_" + str(t) + ".s" # change \\ to / on Linux
    exeFile = "nop_files\\mulo2_" + str(t)
    if os.path.isfile(sourceFile)==False:
        addNop(t, sourceFile)
        os.system("gcc " + sourceFile + " -o " + exeFile)
    runtime = os.popen("timecmd " + exeFile).read() # use time
    print(str(t) + " nop: " + str(runtime))

Result:

0 nop: command took 0:0:4.96 (4.96s total)

1 nop: command took 0:0:4.94 (4.94s total)

2 nop: command took 0:0:4.90 (4.90s total)

3 nop: command took 0:0:4.90 (4.90s total)

4 nop: command took 0:0:5.26 (5.26s total)

5 nop: command took 0:0:4.94 (4.94s total)

6 nop: command took 0:0:4.92 (4.92s total)

7 nop: command took 0:0:4.98 (4.98s total)

8 nop: command took 0:0:5.02 (5.02s total)

9 nop: command took 0:0:4.97 (4.97s total)

10 nop: command took 0:0:5.12 (5.12s total)

11 nop: command took 0:0:5.01 (5.01s total)

12 nop: command took 0:0:5.01 (5.01s total)

13 nop: command took 0:0:5.07 (5.07s total)

14 nop: command took 0:0:5.08 (5.08s total)

15 nop: command took 0:0:5.07 (5.07s total)

16 nop: command took 0:0:5.09 (5.09s total)

17 nop: command took 0:0:7.96 (7.96s total) # slow 17

18 nop: command took 0:0:7.93 (7.93s total)

19 nop: command took 0:0:7.88 (7.88s total)

20 nop: command took 0:0:7.88 (7.88s total)

21 nop: command took 0:0:7.94 (7.94s total)

22 nop: command took 0:0:7.90 (7.90s total)

23 nop: command took 0:0:7.92 (7.92s total)

24 nop: command took 0:0:7.99 (7.99s total)

25 nop: command took 0:0:7.89 (7.89s total)

26 nop: command took 0:0:7.88 (7.88s total)

27 nop: command took 0:0:7.88 (7.88s total)

28 nop: command took 0:0:7.84 (7.84s total)

29 nop: command took 0:0:7.84 (7.84s total)

30 nop: command took 0:0:7.88 (7.88s total)

31 nop: command took 0:0:7.91 (7.91s total)

32 nop: command took 0:0:7.89 (7.89s total)

33 nop: command took 0:0:7.88 (7.88s total)

34 nop: command took 0:0:7.94 (7.94s total)

35 nop: command took 0:0:7.81 (7.81s total)

36 nop: command took 0:0:7.89 (7.89s total)

37 nop: command took 0:0:7.90 (7.90s total)

38 nop: command took 0:0:7.92 (7.92s total)

39 nop: command took 0:0:7.83 (7.83s total)

40 nop: command took 0:0:4.95 (4.95s total) # fast 40

41 nop: command took 0:0:4.91 (4.91s total)

42 nop: command took 0:0:4.97 (4.97s total)

43 nop: command took 0:0:4.97 (4.97s total)

44 nop: command took 0:0:4.97 (4.97s total)

45 nop: command took 0:0:5.11 (5.11s total)

46 nop: command took 0:0:5.13 (5.13s total)

47 nop: command took 0:0:5.01 (5.01s total)

48 nop: command took 0:0:5.01 (5.01s total)

49 nop: command took 0:0:4.97 (4.97s total)

50 nop: command took 0:0:5.03 (5.03s total)

51 nop: command took 0:0:5.32 (5.32s total)

52 nop: command took 0:0:4.95 (4.95s total)

53 nop: command took 0:0:4.97 (4.97s total)

54 nop: command took 0:0:4.94 (4.94s total)

55 nop: command took 0:0:4.99 (4.99s total)

56 nop: command took 0:0:4.99 (4.99s total)

57 nop: command took 0:0:5.04 (5.04s total)

58 nop: command took 0:0:4.97 (4.97s total)

59 nop: command took 0:0:4.97 (4.97s total)

60 nop: command took 0:0:4.95 (4.95s total)

61 nop: command took 0:0:4.99 (4.99s total)

62 nop: command took 0:0:4.94 (4.94s total)

63 nop: command took 0:0:4.94 (4.94s total)

64 nop: command took 0:0:4.92 (4.92s total)

65 nop: command took 0:0:4.91 (4.91s total)

66 nop: command took 0:0:4.98 (4.98s total)

67 nop: command took 0:0:4.93 (4.93s total)

68 nop: command took 0:0:4.95 (4.95s total)

69 nop: command took 0:0:4.92 (4.92s total)

70 nop: command took 0:0:4.93 (4.93s total)

71 nop: command took 0:0:4.97 (4.97s total)

72 nop: command took 0:0:4.93 (4.93s total)

73 nop: command took 0:0:4.94 (4.94s total)

74 nop: command took 0:0:4.96 (4.96s total)

75 nop: command took 0:0:4.91 (4.91s total)

76 nop: command took 0:0:4.92 (4.92s total)

77 nop: command took 0:0:4.91 (4.91s total)

78 nop: command took 0:0:5.03 (5.03s total)

79 nop: command took 0:0:4.96 (4.96s total)

80 nop: command took 0:0:5.20 (5.20s total)

81 nop: command took 0:0:7.93 (7.93s total) # slow 81

82 nop: command took 0:0:7.88 (7.88s total)

83 nop: command took 0:0:7.85 (7.85s total)

84 nop: command took 0:0:7.91 (7.91s total)

85 nop: command took 0:0:7.93 (7.93s total)

86 nop: command took 0:0:8.06 (8.06s total)

87 nop: command took 0:0:8.03 (8.03s total)

88 nop: command took 0:0:7.85 (7.85s total)

89 nop: command took 0:0:7.88 (7.88s total)

90 nop: command took 0:0:7.91 (7.91s total)

91 nop: command took 0:0:7.86 (7.86s total)

92 nop: command took 0:0:7.99 (7.99s total)

93 nop: command took 0:0:7.86 (7.86s total)

94 nop: command took 0:0:7.91 (7.91s total)

95 nop: command took 0:0:8.12 (8.12s total)

96 nop: command took 0:0:7.88 (7.88s total)

97 nop: command took 0:0:7.81 (7.81s total)

98 nop: command took 0:0:7.88 (7.88s total)

99 nop: command took 0:0:7.85 (7.85s total)

100 nop: command took 0:0:7.90 (7.90s total)

101 nop: command took 0:0:7.93 (7.93s total)

102 nop: command took 0:0:7.85 (7.85s total)

103 nop: command took 0:0:7.88 (7.88s total)

104 nop: command took 0:0:5.00 (5.00s total) # fast 104

105 nop: command took 0:0:5.03 (5.03s total)

106 nop: command took 0:0:4.97 (4.97s total)

107 nop: command took 0:0:5.06 (5.06s total)

108 nop: command took 0:0:5.01 (5.01s total)

109 nop: command took 0:0:5.00 (5.00s total)

110 nop: command took 0:0:4.95 (4.95s total)

111 nop: command took 0:0:4.91 (4.91s total)

112 nop: command took 0:0:4.94 (4.94s total)

113 nop: command took 0:0:4.93 (4.93s total)

114 nop: command took 0:0:4.92 (4.92s total)

115 nop: command took 0:0:4.92 (4.92s total)

116 nop: command took 0:0:4.92 (4.92s total)

117 nop: command took 0:0:5.13 (5.13s total)

118 nop: command took 0:0:4.94 (4.94s total)

119 nop: command took 0:0:4.97 (4.97s total)

120 nop: command took 0:0:5.14 (5.14s total)

121 nop: command took 0:0:4.94 (4.94s total)

122 nop: command took 0:0:5.17 (5.17s total)

123 nop: command took 0:0:4.95 (4.95s total)

124 nop: command took 0:0:4.97 (4.97s total)

125 nop: command took 0:0:4.99 (4.99s total)

126 nop: command took 0:0:5.20 (5.20s total)

127 nop: command took 0:0:5.23 (5.23s total)

128 nop: command took 0:0:5.19 (5.19s total)

129 nop: command took 0:0:5.21 (5.21s total)

130 nop: command took 0:0:5.33 (5.33s total)

131 nop: command took 0:0:4.92 (4.92s total)

132 nop: command took 0:0:5.02 (5.02s total)

133 nop: command took 0:0:4.90 (4.90s total)

134 nop: command took 0:0:4.93 (4.93s total)

135 nop: command took 0:0:4.99 (4.99s total)

136 nop: command took 0:0:5.08 (5.08s total)

137 nop: command took 0:0:5.02 (5.02s total)

138 nop: command took 0:0:5.15 (5.15s total)

139 nop: command took 0:0:5.07 (5.07s total)

140 nop: command took 0:0:5.03 (5.03s total)

141 nop: command took 0:0:4.94 (4.94s total)

142 nop: command took 0:0:4.92 (4.92s total)

143 nop: command took 0:0:4.96 (4.96s total)

144 nop: command took 0:0:4.92 (4.92s total)

145 nop: command took 0:0:7.86 (7.86s total) # slow 145

146 nop: command took 0:0:7.87 (7.87s total)

147 nop: command took 0:0:7.83 (7.83s total)

148 nop: command took 0:0:7.83 (7.83s total)

149 nop: command took 0:0:7.84 (7.84s total)

150 nop: command took 0:0:7.87 (7.87s total)

151 nop: command took 0:0:7.84 (7.84s total)

152 nop: command took 0:0:7.88 (7.88s total)

153 nop: command took 0:0:7.87 (7.87s total)

154 nop: command took 0:0:7.83 (7.83s total)

155 nop: command took 0:0:7.85 (7.85s total)

156 nop: command took 0:0:7.91 (7.91s total)

157 nop: command took 0:0:8.18 (8.18s total)

158 nop: command took 0:0:7.94 (7.94s total)

159 nop: command took 0:0:7.92 (7.92s total)

160 nop: command took 0:0:7.92 (7.92s total)

161 nop: command took 0:0:7.97 (7.97s total)

162 nop: command took 0:0:8.12 (8.12s total)

163 nop: command took 0:0:7.89 (7.89s total)

164 nop: command took 0:0:7.92 (7.92s total)

165 nop: command took 0:0:7.88 (7.88s total)

166 nop: command took 0:0:7.80 (7.80s total)

167 nop: command took 0:0:7.82 (7.82s total)

168 nop: command took 0:0:4.97 (4.97s total) # fast

169 nop: command took 0:0:4.97 (4.97s total)

170 nop: command took 0:0:4.95 (4.95s total)

171 nop: command took 0:0:5.00 (5.00s total)

172 nop: command took 0:0:4.95 (4.95s total)

173 nop: command took 0:0:4.93 (4.93s total)

174 nop: command took 0:0:4.91 (4.91s total)

175 nop: command took 0:0:4.92 (4.92s total)

Points where the program switch from fast to slow (then slow to fast) are: 17S-40F-81S-104F-145S-168F. We can see the distance from slow->fast code is 23 nop, and the distance from fast->slow code is 41 nop. When we check objdump, we can see that the main loop occupies 24 bytes; that means if we place it at the start of a cache line (address mod 64 == 0), inserting 41 bytes will cause the main loop to cross the cache-line boundary, causing slowdown. So in the default code (no nop added), the main loop is already inside the same cache line.

So we know that the -O2 version being slower is not because of instruction address alignment. The only culprit left is instruction decoding speed We found a new culprit, like @Jérôme Richard answer.

Edit 4: Skylake decodes 16 bytes per cycle. However, the size of -Os and -O2 version are 21 and 24 respectively, so both requires 2 cycles to read the main loop. So where does speed the difference come from?

Conclusion: while the compiler is theoretically correct (lea + sal are 2 super cheap instructions, and addressing inside lea is free since it uses a separate hardware circuit), in practice 1 single expensive instruction imul might be faster due to some extremely complex details about CPU architecture, which include instruction decoding speed, micro-operation (uops) amount, and CPU ports.

Fremont answered 11/12, 2021 at 15:59 Comment(11)
Multiplying by 8 is just shifting to the left by three bits.Fabe
but that's still 4 instruction total: lea, 2 shift, 1 add. Is mul really that slow?Fremont
Btw did you try to benchmark this over billions of main() calls? (or renaming main() as f() for instance) just in case...Customary
I don't know how to benchmark assembly code directly :\ so I haven't tested the 2 versionFremont
Rename 'main' as 'f' (inline function or just loop over that) and in the new main() call f() a billion times. Now generate one exec with Os and another one with O2, and, not so accurate but, an easy test is (Linux) time firstone, time secondoneCustomary
@Déjàvu right, what a brain lag moment. I will benchmark that. However, doing that might add some stack push/pop, which might make the raw difference between the computation instructions less noticeableFremont
I think multiplier is much more complex than adder in circuits. The factor in lea is one of 1, 2, 4, 8 so I guess it's hard-wired. Also lea does not set the FLAGS register whereas imul do.Fabe
(btw the loop overhead might bias the result since it will also be optimized, maybe make a loop with the code copy-pasted 10 times, and check the assembly ; as for your comment, I modified mine above to recommend an inline func or just a loop in main)Customary
[rax + rax*8] is translated into machine code as a "complex memory address", ie exactly how it's written, not split into additional instructions. Related: x64 instruction encoding and the ModRM bytePavilion
I've just ran the benchmark and the -O2 version is actually slower than -Os. Can someone confirm?Fremont
Skylake decodes 16 bytes per cycle - Skylake has a uop cache (DSB), exactly because legacy decode has even worse throughput problems than the 4 uops/clock issue/rename bottleneck. (New in Sandybridge-family, see realworldtech.com/sandy-bridge and Agner Fog's microarch guide - agner.org/optimize. ) SnB-family also has a loop buffer (LSD), but unfortunately Skylake-family has that disabled to work around CPU bugs. Still, the uop cache is able to fetch up to 6 uops / clock from a line that caches uops from a 32-byte chunk of x86 machine code.Shushubert
C
6

You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal instruction (2 per cycle and 1 cycle of latency).

This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.

Note that the lea instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired shifters) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea is faster than imul/mul.


UPDATE (v2):

I can reproduce the slower execution with -O2 using GCC 11.2 (on Linux with a i5-9600KF processor).

The main source of source of slowdown comes from the higher number of micro-operations (uops) to be executed in the -O2 version certainly combined with the saturation of some execution ports certainly due to a bad micro-operation scheduling.

Here is the assembly of the loop with -Os:

    1049:   8b 15 d9 2f 00 00       mov    edx,DWORD PTR [rip+0x2fd9]        # 4028 <a>
    104f:   6b d2 24                imul   edx,edx,0x24
    1052:   89 15 d8 2f 00 00       mov    DWORD PTR [rip+0x2fd8],edx        # 4030 <res>
    1058:   48 ff c8                dec    rax
    105b:   75 ec                   jne    1049 <main+0x9>

Here is the assembly of the loop with -O2:

    1050:   8b 05 d2 2f 00 00       mov    eax,DWORD PTR [rip+0x2fd2]        # 4028 <a>
    1056:   8d 04 c0                lea    eax,[rax+rax*8]
    1059:   c1 e0 02                shl    eax,0x2
    105c:   89 05 ce 2f 00 00       mov    DWORD PTR [rip+0x2fce],eax        # 4030 <res>
    1062:   48 83 ea 01             sub    rdx,0x1
    1066:   75 e8                   jne    1050 <main+0x10>

Modern x86-64 processors, decode (variable-sized) instructions and then translate them to (simpler fixed-sized) micro-operations finally executed (often in parallel) on several execution ports. More information about the specific Skylake architecture can be found here. Skylake can macro-fuse multiple instructions into only one micro-operation. In this case, the dec+jne and the sub+jne instructions are fused into one uops in each case. This means that the -Os version executes 4 uops/iteration while the -O2 executes 5 uops/iteration.

The uops are stored in a uop-cache called the Decoded Stream Buffer (DSB) so that the processor do not need to decode/translate again the instructions of a (small) loop. Cached uops to be executed are sent in a queue called the Instruction Decode Queue (IDQ). Up to 6 uops/cycle can be sent from the DSB to the IDQ. For the -Os version, only 4 uops of the DSB are sent to the IDQ every cycle (likely because the loop is bounded by the store port which is saturated). For the -O2 version, 5 uops of the DSB are sent to the IDQ only every cycle, but 4 out of 5 times (in average)! This means that 1 cycle of latency is added every 4 cycle resulting in a 25% slower execution. The cause of this effect is unclear and appear to be related to the uops scheduling.

Uops are then sent to the Resource Allocation Table (RAT) and issued to Reservation Station (RS). The RS dispatches the uops to ports that execute them. Then, the uops are retired (ie. committed). The number of uops indirectly transmitted from the DSB to the RS is constant for both versions. The same amount of uops is retired. However, 1 more ghost uop is dispatched by the RS every cycle (and executed by the ports) in both versions. This is probably an uops used to compute the address of the store (since the store port does not have its own dedicated AGU).

Here is a statistics per iteration gathered from hardware counters (using perf):

version | instruction | issued-uops | executed-uops | retired-uops | cycles
"-Os"   |      5      |      4      |        5      |       4      |  1.00
"-O2"   |      6      |      5      |        6      |       5      |  1.25

Here is the statistics of the overall port utilization:

 port  |   type      |  "-Os"  |   "-O2"
-----------------------------------------
    0  | ALU/BR      |     0%  |    60%
    1  | ALU/MUL/LEA |   100%  |    38%
    2  | LOAD/AGU    |    65%  |    60%
    3  | LOAD/AGU    |    73%  |    60%
    4  | STORE       |   100%  |    80%
    5  | ALU/LEA     |     0%  |    42%
    6  | ALU/BR      |   100%  |   100%
    7  | AGU         |    62%  |    40%
-----------------------------------------
 total |             |   500%  |   480%

The port 6 is only the fully-saturated on the -O2 version which is unexpected and this certainly explains why there is an additional cycle needed every 5 cycle. Note that only the uops associated to the instructions shl and sub+jne are using (simultaneously) the port 0 and 6 (and no other ports).

Note that the total of 480% is a scheduling artifact due to the stalling cycle. Indeed, 6*4=24 uops should be executed every 5 cycles (24/5*100=480). Note also that the store port is not needed 1 out of 5 cycles (4 iterations are executed every 5 cycles in average and so 4 store uops), hence its 80% usage.


Related:

Cymric answered 11/12, 2021 at 16:31 Comment(11)
I've just updated the question with a sample code (execute both instructions 10^10 times), and for some reason the -O2 version is SLOWER. Can you run the benchmark to see what's wrong?Fremont
Ok, I can reproduce the problem although the generated code is not totally equivalent. I clarified the question on the store instruction to point out that the execution is bounded by the stores and so because of that you should not see a significant performance differences with -O2. That being said, I did not expect this to be slower. I think this is due to the decoding of the instructions. So the answer will be a bit more complex because of that ;) .Schnorrer
So there's 1 thing left to do: can you try to add some instructions in -O2 version so that the main loop is contained in a same cache line? Then benchmark that again. also, which software do you use to see the address of an instruction?Fremont
Yeah. objdump can be used to read the assembly code of a program with the (virtual) addresses. For now, my current running environment (Windows) does not allows me to track what is going on precisely (so I trusted Godbolt and described what could happen so far). I will switch soon on a Linux to be more precise/sure about this (hopping the problem can be fully reproduced). Asking GCC to align loops in the -O2 case did not fix the performance issue, so it is likely due the decoder throughput. Results are very consistent so far with the generated binary code. I'll keep you informed.Schnorrer
I have just added a script to generate all possible alignment of the instruction addresses. It shows that in the default case, the main loop is inside the same cache line, unlike you commented. Can you update the answer for future readers? Anyway, I guess the only possible answer left is CPU instruction decoding speedFremont
Note that x86 addressing modes encode the scale factor as a 2-bit shift count. So it's not just "hardwired multiply", it's assemble-time conversion to a shift count, which of course is quite cheap. (A barrel shifter that only has to support 4 different shift counts is even simpler than the full barrel shifter needed to support instructions like shl efficiently.) So it's very significant that the allowed scale factors are powers of 2. (And yes, using [same + same*scale] you can get 2^n+1 scaling if you don't add to another reg.)Shushubert
Related: How to multiply a register by 37 using only 2 consecutive leal instructions in x86? which compares latency vs. imul and points out that recent compilers optimize tend to favour latency over front-end throughput here. (But @HuyLe's loop is loading from one global, storing to another, so it's only measuring throughput, not latency. If it was storing back into the same variable, latency of store-forwarding plus ALU would be the only factor, hiding any front-end throughput issues.)Shushubert
@HuyLe: For GCC, using volatile stops it from folding the load into a memory operand for imul, like imul eax, a[rip], 36 which is only 1 uop for the front-end, micro-fused, on Skylake. (Except Intel can't micro-fuse an instruction with a RIP-relative addressing mode and and immediate, so it would only save front-end bandwidth if you loaded the address into another register outside the loop, for imul eax, [rcx], 36 or something.)Shushubert
I have edited the answer based on a much deeper analysis. It turn out the issue is much more complex to understand/explain than I though. I can confirm that the problem do not comes from the alignment, but I also found out that the problem do not comes from the decoding part either! Hardware counters indicate that it is either a (very strange) artifact of the front-end or simply a bad port scheduling. I did not find any hardware counter to discriminate the two choices. However, the second one seems very likely to happen regarding the gathered statistics.Schnorrer
This means that 1 cycle of latency is added every 4 cycle resulting - We don't call that latency, it's not part of a dependency chain, or hurting throughput by lengthening a dep chain. It's missed throughput. Otherwise yeah, much better explanation. And yes, 5-uop loops don't always run perfectly, even if all five uops come from the same line of the uop cache. (If not, it's far worse, 1 iter / 2 clocks). Is performance reduced when executing loops whose uop count is not a multiple of processor width? (5 uops happens to be the same with/without LSD)Shushubert
I wouldn't be surprised if we're getting some resource conflicts with shl "stealing" cycles on port 6, which is needed for predicted-taken branches. You can check for that with uops_dispatched_port.port_6 and seeing if it's saturated, (counts for that = cycles), even though the loop isn't running at peak front-end throughput of 4 uops / clock. Or try changing it to lea reg, [reg*4] which can only run on p15, rather than p06.Shushubert
M
3

tl;dr: Because LEA doesn't do full-fledged multiplication.

While @JeromeRichard's answer is correct, the underlying kernel of truth is hidden in its last sentence: With LEA, you can only multiple by a specific constant, which is a power of two. Thus, instead of needing a large dedicated circuit for multiplication, it only needs a small sub-circuit for shifting one of its operands by a fixed amount.

Manna answered 11/12, 2021 at 16:50 Comment(6)
Could you benchmark the code I provided in edit 2? It shows the -Os version actually running fasterFremont
@HuyLe: I think you need to separate your second edit into its own question, because you're asking something else. Link the new question to this one. Also, please present complete examples, i.e. two assembly programs or two C programs; it's difficult to understand exactly what you ran.Manna
But the second edit contains the same instruction. I'm just benchmarking them 10^10 times instead of 1?Fremont
@HuyLe: It's a different question. One question is about two assembly operators in general - even if the motivation is a given program; another question is about a specific program's runtime. And again, I would need a proper MRE.Manna
The assembly code is gotten from -O2. You can replace the lines "lea eax ...", with "imul eax..." to get the -Os code. Basically the program is the same everywhere, except those 2 lines. Use "gcc mul.s -o mul" to get a runnable programFremont
I think it's MRE though, since the benchmark program is compilable without modifiation, and the "critical part" is only 2 linesFremont

© 2022 - 2024 — McMap. All rights reserved.