How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
Asked Answered
A

1

8

Say %edi contains x and I want to end up with 37*x using only 2 consecutive leal instructions, how would I go about this?

For example to get 45x you would do

leal (%edi, %edi, 8), %edi   
leal (%edi, %edi, 4), %eax (to be returned)

I cannot for the life of me figure out what numbers to put in place of the 8 and 4 so that the result (%eax) will be 37x

Alleen answered 29/9, 2017 at 1:44 Comment(2)
Did you try asking a compiler? godbolt.org/g/nMbujJ. Hint: you have to change more than the scale factors. The 2nd LEA uses the original input + the first LEA result.Bish
Also, given your choice of registers, this look like you're using 64-bit code for the System V ABI. There's no benefit to using an address-size override prefix to get 32-bit addressing modes in 64-bit mode. It's always safe to let lea truncate a 64-bit addressing mode to 32-bit. stackoverflow.com/questions/34377711/…Bish
B
14

At -O3, gcc will emit (Godbolt compiler explorer):

int mul37(int a)  { return a*37; }

    leal    (%rdi,%rdi,8), %eax      # eax = a * 9
    leal    (%rdi,%rax,4), %eax      # eax = a + 4*(a*9)
    ret

That's using 37 = 9*4 + 1, not destroying the original a value with the first lea so it can use both in the 2nd.

You're in good company in not spotting this one, though: recent clang (3.8 and newer) will normally use 2 lea instructions instead of an imul (e.g. for *15), but it misses this one and uses:

    imull   $37, %edi, %eax
    ret

It does do *21 with the same pattern as gcc uses, as 5*4 + 1. (clang3.6 and earlier always used imul unless there was a single-instruction alternative shl or lea)

ICC and MSVC also use imul, but they don't seem to like using 2 lea instructions, so the imul is "on purpose" there.

See the godbolt link for a variety of multipliers with gcc7.2 vs. clang5.0. It's interesting to try gcc -m32 -mtune=pentium or even pentium3 to see how many more instructions gcc was wiling to use back then. Although P2/P3 has 4-cycle latency for imul r, r, i, so that's kinda crazy. Pentium has 9 cycle imul and no OOO to hide the latency, so it makes sense to try hard to avoid it.

mtune=silvermont should probably only be willing to replace 32-bit imul with a single instruction, because it has 3-cycle latency / 1c throughput multiply, but decode is often the bottleneck (according to Agner Fog, http://agner.org/optimize/). You could even consider imul $64, %edi, %eax (or other powers of 2) instead of mov/shl, because imul-immediate is a copy-and-multiply.


Ironically, gcc misses the * 45 case, and uses imul, while clang uses 2 leas. Guess it's time to file some missed-optimization bug reports. If 2 LEAs are better than 1 IMUL, they should be used wherever possible.

Older clang (3.7 and older) uses imul unless a single lea will do the trick. I haven't looked up the changelog to see if they did benchmarks to decide to favour latency over throughput.


Related: Using LEA on values that aren't addresses / pointers? canonical answer about why LEA uses memory-operand syntax and machine encoding, even though it's a shift+add instruction (and runs on an ALU, not AGU, in most modern microarchitectures.)

Bish answered 29/9, 2017 at 3:8 Comment(2)
"clang will normally use 2 lea instructions instead of an imul" I've made the opposite observation. From what I've seen, Clang tends to prefer emitting a single IMUL over any alternative sequence of instructions. I thought about reporting it as an optimization defect a while back, actually, but then I decided it just doesn't matter. The performance impact is virtually immeasurable. :-)Kimmy
@CodyGray: that changed in clang3.7 or something. I guess they decided to favour latency over throughput. But yes, older clang did prefer imul unless it could use a single lea.Bish

© 2022 - 2024 — McMap. All rights reserved.