Using LEA on values that aren't addresses / pointers?

A

4

15

I was trying to understand how Address Computation Instruction works, especially with leaq command. Then I get confused when I see examples using leaq to do arithmetic computation. For example, the following C code,

long m12(long x) {
return x*12;
}

In assembly,

leaq (%rdi, %rdi, 2), %rax
salq $2, $rax

If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2), which should be 2*%rdi+%rdi, evaluate to into %rax. What I get confused is since value x is stored in %rdi, which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?

Avent answered 6/10, 2017 at 1:36 Comment(2)

@Johan, I closed stackoverflow.com/questions/13517083/… as a duplicate of this, since this has answers that go into more detail to clear up the newbie confusion over using LEA with non-pointers. – Griselgriselda 6/10, 2017 at 21:32

Related: What's the purpose of the LEA instruction? is mostly asking about lea vs. mov, which is approaching the same question from the opposite direction. The answers there all talk about using it for addresses/pointers, or else just say "it's a stupid name for a shift-and-add instruction", both of which only tell half the story. – Griselgriselda 6/10, 2017 at 21:41

L

18

leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"^† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.

Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).

So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).

^†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.

Lilytrotter answered 6/10, 2017 at 1:45 Comment(5)

So, if we pass in x as 1. Assume register is 4 bit, %rdi will be 0001 or 0x1 ? (If we ignore type long) – Avent 6/10, 2017 at 2:13

I'd argue that's not an abuse of LEA, copy-and-add is one of the intended purposes of exposing the CPU's address-generation ability through the lea instruction. See my answer. – Griselgriselda 6/10, 2017 at 2:43

@ZhiyuanRuan yes, types like int/short/long/... are in common x86-64 ABIs passed by value, the value itself is in register when calling some function in ABI conforming way. No memory address is involved in your original assembly from compiler. – Echoechoic 6/10, 2017 at 4:33

@PeterCordes: "abuse" is mostly related to the terminology used to describe the instruction (load effective address); it's designed for address generation, but registers are registers, and the math is the same either way. I'm not saying it's bad to use lea, just not what the instruction's name would lead you to believe was the purpose. – Lilytrotter 6/10, 2017 at 10:32

That's where I disagree. I think of it as being designed to expose the address-generation functionality of the hardware for use for arbitrary purposes. That's how compilers think of it, and so should humans. The naming is just connected to the fact that it uses addressing-mode syntax and machine-encoding, not the "intended" purpose. (I don't really know what Intel had in mind, as I said in my answer, but I think explaining it to beginners this way makes it sound normal to use LEA, because it is normal. That's why I dislike the term "abuse", but that's a fair justification for using it.) – Griselgriselda 6/10, 2017 at 10:40

G

41

lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. This explains the name, but it's not the only thing it's good for. It never actually accesses memory, so it's like using & in C.

See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?

In C, it's like uintptr_t foo = (uintptr_t) &arr[idx]. Note the & to give you arr + idx (scaling for the object size of arr since this is C not asm). In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing. Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.

Effective address is a technical term in x86: it means the "offset" part of a seg:off logical address, especially when a base_reg + index*scale + displacement calculation was needed. e.g. the rax + (rcx<<2) in a %gs:(%rax,%rcx,4) addressing mode. (But EA still applies to %rdi for stosb, or the absolute displacement for movabs load/store, or other cases without a ModRM addr mode). Its use in this context doesn't mean it must be a valid / useful memory address, it's telling you that the calculation doesn't involve the segment base so it's not calculating a linear address. (Adding the seg base would make it unusable for actual address math in a non-flat memory model.)

The original designer / architect of 8086's instruction set (Stephen Morse) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and so should humans.

(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16, so LEA wasn't as useful for non-pointer math before 386. See this Q&A for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.)

Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors. The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations. Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors. Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware.

Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions. They have dedicated AGUs (address-generation units), but only use them for actual memory operands. In-order Atom is one exception; LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. Out-of-order execution CPUs (all modern x86) don't want LEA to interfere with actual loads/stores so they run it on an ALU.

lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add. (See Agner Fog's x86 microarch guide and asm optimization manual and https://uops.info/)
Ice Lake improved on that for Intel, now able to run LEA on all four ALU ports.

Rules for which kinds of LEA are "complex", running on fewer of the ports that can handle it, vary by microarchitecture. e.g. 3-component (two + operations) is the slower case on SnB-family, having a scaled index is the lower-throughput case on Ice Lake. Alder Lake E-cores (Gracemont) are 4/clock, but 1/clock when there's an index at all, and 2-cycle latency when there's an index and displacement (whether or not there's a base reg). Zen is slower when there's a scaled index or 3 components. (2c latency and 2/clock down from 1c and 4/clock).

The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction. (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands.

So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.

x86-64 got cheap access to the program counter (instead of needing to read what call pushed) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.)

It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days. It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language. It has lower throughput than add, but it's cheap enough to use almost all the time when it saves even one instruction. But it can save up to three instructions:

;; Intel syntax.
lea  eax, [rdi + rsi*4 - 8]   ; 3 cycle latency on Intel SnB-family
                              ; 2-component LEA is only 1c latency

 ;;; without LEA:
mov  eax, esi             ; maybe 0 cycle latency, otherwise 1
shl  eax, 2               ; 1 cycle latency
add  eax, edi             ; 1 cycle latency
sub  eax, 8               ; 1 cycle latency

On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready. Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.

lea has several major benefits, especially in 32/64-bit code where addressing modes can use any register and can shift:

non-destructive: output in a register that isn't one of the inputs. It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx.
can do 3 or 4 operations in one instruction (see above).
Math without modifying EFLAGS, can be handy after a test before a cmovcc. Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.
x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data.

7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. You may need to disable the default PIE setting in gcc to use this.

In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol]. (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers.

Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.

See also the x86 <!--> tag wiki for assembly guides / manuals, and performance info.

Operand-size vs. address-size for x86-64 lea

See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?. 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx.

x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx, but 64-bit address / operand size is obviously required for doing 64-bit math. (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.)

Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency. I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.

This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?, but most of the answers explain it in terms of address calculation on actual pointer data. That's only one use.

Griselgriselda answered 6/10, 2017 at 2:25 Comment(5)

Would you agree that in the LEA manual page you linked to, the second comment in the following text (under "operation") is a copy/paste error:

ELSE IF OperandSize = 32 and AddressSize = 64         THEN             temp ← EffectiveAddress(SRC); (* 64-bit address *)             DEST ← temp[0:31]; (* 16-bit address *)         FI;

because temp[0:31] is a 32 bit address, not a 16 bit address? – Garlic 6/2, 2022 at 16:32

Yes, should be 32-bit address. Also, there seems to be a missing case for 32-bit address-size, 64-bit operand-size like lea rax, [edi - 1], if they're going to explicitly catalogue every other combo. (That one is fully useless, though, because it's identical to lea eax, [rdi - 1] (which doesn't require any prefixes), because both are zero-extension, not sign.) – Griselgriselda 6/2, 2022 at 16:43

OK, thanks. That is a bit embarrassing because the widths are what it's all about, even if it's a comment only. I just checked -- the passage is copied verbatim from the Intel manual (the link is too long for a comment but found on intel.com/content/www/us/en/developer/articles/technical/…). The error is present there, too, on vol. 2A, page 3-581 – Garlic 6/2, 2022 at 17:5

@Peter-ReinstateMonica: Yes, felixcloutier.com/x86 and similar sites like github.com/HJLebbink/asm-dude/wiki are scraped from Intel's vol.2 PDF manual. (Both of those with similar versions of the same script.) That's not the first bug to be found in Intel's manuals, and not the most serious either. – Griselgriselda 6/2, 2022 at 17:8

Update: Stephen Morse, architect of the 8086 ISA, wrote a book, the 8086 Primer, where he did explain some of the intended uses of various instructions. He describes LEA as having been intended for generating actual addresses, and doesn't mention copy-and-add like lea ax, [bx + 1]. stevemorse.org/8086. Especially with 386 making addressing modes more flexible, IMO it's still more useful to think of LEA as a math instruction first, with actual addressing being just one of its uses. – Griselgriselda 7/4, 2023 at 7:29

L

18

leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"^† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.

Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).

So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).

^†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.

Lilytrotter answered 6/10, 2017 at 1:45 Comment(5)

So, if we pass in x as 1. Assume register is 4 bit, %rdi will be 0001 or 0x1 ? (If we ignore type long) – Avent 6/10, 2017 at 2:13

I'd argue that's not an abuse of LEA, copy-and-add is one of the intended purposes of exposing the CPU's address-generation ability through the lea instruction. See my answer. – Griselgriselda 6/10, 2017 at 2:43

@ZhiyuanRuan yes, types like int/short/long/... are in common x86-64 ABIs passed by value, the value itself is in register when calling some function in ABI conforming way. No memory address is involved in your original assembly from compiler. – Echoechoic 6/10, 2017 at 4:33

@PeterCordes: "abuse" is mostly related to the terminology used to describe the instruction (load effective address); it's designed for address generation, but registers are registers, and the math is the same either way. I'm not saying it's bad to use lea, just not what the instruction's name would lead you to believe was the purpose. – Lilytrotter 6/10, 2017 at 10:32

That's where I disagree. I think of it as being designed to expose the address-generation functionality of the hardware for use for arbitrary purposes. That's how compilers think of it, and so should humans. The naming is just connected to the fact that it uses addressing-mode syntax and machine-encoding, not the "intended" purpose. (I don't really know what Intel had in mind, as I said in my answer, but I think explaining it to beginners this way makes it sound normal to use LEA, because it is normal. That's why I dislike the term "abuse", but that's a fair justification for using it.) – Griselgriselda 6/10, 2017 at 10:40

M

3

LEA is for calculating the address. It doesn't dereference the memory address

It should be much more readable in Intel syntax

m12(long):
  lea rax, [rdi+rdi*2]
  sal rax, 2
  ret

So the first line is equivalent to rax = rdi*3 Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12

Meadowlark answered 6/10, 2017 at 1:45 Comment(0)

S

0

I think the confusion arises because the first operand, (%rdi, %rdi, 2) looks like a memory reference.

From the book Computer Systems: A Programmer's Perspective by Randal Bryant and David O'Hallaron about leaq:

Its first operand appears to be a memory reference, but instead of reading from the designated location, the instruction copies the effective address to the destination.

And here is the relevant part:

This instruction can be used to generate pointers for later memory references. In addition, it can be used to compactly describe common arithmetic operations. For example, if register rdx contains value x, then the instruction leaq 7(%rdx,%rdx, 4) , %rax will set register %rax to 5x+7. Compilers often find clever uses of leaq that have nothing to do with effective address computations.

Seringapatam answered 7/4, 2023 at 7:19 Comment(0)

Recommended topics

Hot tags