Why is the `jmp` at the start of the PLT stub needed?
Asked Answered
T

1

6

The way PLT usage is specified in the SystemV ABI (and implemented in practice), is schematically somtehing like this:

# A call from somewhere in code is into a PLT slot
# (In reality not a direct call, in x64 typically an rip-relative one)
0x500:   
          call 0x1000   
...

0x1000:
   .PLT1: jmp [0x2000]  # the slot for f in the binary's GOT
          pushq $index_f
          jmp .PLT0
...
0x2000: 
# initially jumps back to .PLT to call the lazy-binding routine:
   .GOT1: 0x1005
# but after that is called:
          0x3000   # the address of the real implementation of f
...
0x3000:
     f:  ....

My question is:

isn't the 1st jmp in the PLT slot redundant? Couldn't this work with an indirect call into the GOT instead? For example:

0x500:   
          call [0x2000]
...

0x1000:
   .PLT1: pushq $index_f
          jmp .PLT0
...
0x2000: 
# initially jumps back to .PLT to call the lazy-binding routine:
   .GOT1: 0x1005
# but after that is called:
          0x3000   # the address of the real implementation of f
...
0x3000:
     f:  ....

This might have marginal performance benefits - but the reason I'm asking is a recent scramble in the linkers/elf community to come up with extra bytes in a 16-byte PLT slot to accommodate intel IBT (the search failed, and resulted in an extra .plt.sec indirection. 1, 2)

Tucker answered 27/8, 2023 at 13:27 Comment(16)
You must jump to the real function not call it. You could however replace the resolving push+jmp with a call if the resolver looked at the return address to figure out which function it is.Pecan
@Pecan (1) Isn't call+jmp equivalent tp calling the jmp destination? (2) You can't replace push+jmp with call, because after resolution the resolver calls f and you want its ret to return to the original call site.Tucker
1) The call is in the original caller, the PLT should just jmp 2) you can if the resolver pops off the return address and uses that to determine which function it is. Also the resolver will not call f either, it will jump to it (or if it does, then it does a ret afterwards).Pecan
@Pecan Note that in my hypothetical scheme the call is an indirect call into the address in the GOT, not to the PLT. I still can't see why a jmp is necessary.Tucker
Ahha, you mean the original call should be indirect via GOT, okay.Pecan
You still need code somewhere that does jmp [got] in case anybody needs a function pointer.Pecan
@Pecan I thought so too, but learnt here on SO (#76243794) that this isn't needed today. When a func address is taken the function is bound early, and the code takes the address from the (resolved) got slot. (I think the SystemV ABI spec is out of date there)Tucker
Yeah, gcc -fno-plt will put call [rip + foo@GOTPCREL] into caller so no separate jmp is needed. But if you do have a PLT, it needs to jmp to the target function for calls after the initial one. (After lazy resolving. Or for early binding but still using the PLT, the GOT entry will be correct even before the first call so only the jmp [mem] part ever executes, not the push/jmp.)Keitel
@PeterCordes -fno-plt disables lazy binding entirely - that is not my intention. Seems to me lazy-binding could work with the hypothetical scheme above: (1) the call in code is call [rip+foo@GOTPCREL], (2) the GOT entry rip+foo@GOTPCREL initially contains the address of foo@PLT, (3) foo@PLT sets arguments and calls the resolver which overwrites the GOT entry with the address of real foo, (4) on future indirect calls through the GOT call [rip+foo@GOTPCREL] would call foo's implementation. Why is the jmp needed?Tucker
Hmm, perhaps that could work. It would need run-time init of each GOT entry with the right absolute address of the PLT stub, each one being different, although perhaps just adding the same constant to what's already in each of them could work, if you init them with a relative address. Also, non-lazy binding performs better in many cases, for programs that aren't too short-lived (e.g. not stuff like clang --version), so if you're going to change the traditional mechanism, -fno-plt style is a good choice.Keitel
Also keep in mind that the traditional mechanism dates back to i386 (or to Unix on other platforms?). i386 didn't have x86-64's RIP-relative data addressing, only relative direct jmp/call, so every call-site would need to use extra something like call [ebx+puts@GOT] or whatever the right @thing is, after setting up EBX as a GOT pointer in that function. (Which it already needs for accessing global variables). Also, the PLT itself needs a position-independent way to access the GOT. (Traditionally, lazy dynamic linking rewrote a direct jmp rel32 in the PLT, not GOT data.)Keitel
You'd want a mechanism for handling auto fptr = &puts; function pointers. Perhaps just do early binding for those, like now when compiling a PIE, so later calls don't go through the PLT, and code that wants the function pointer just loads directly from the GOT entry.Keitel
Thinking about this some more, the current PLT design already requires those .got.plt absolute addresses to be initialized to point into the middle of each GOT entry. So that's not something that would get worse with your modification. I think PLT entries are usually a fixed size, but I forget if it's normally a power of 2 so they're always aligned. Still, saving space might get them down to 8 bytes. And if only used on the first call, they can be packed without caring about alignment.Keitel
@PeterCordes perhaps you meant "the middle of each PLT entry"? If so - yes, today GOT slots are initialized to the address of the 2nd instruction in the matching PLT entry, in my modification they'd be initialized to the 1st. Traditional PLT slots are 16 bytes and I wasn't interested in cutting this down - just better use them to accommodate for intel IBT (see links in the end of question)Tucker
@PeterCordes your other comment is also exactly how it is handled today: taking a function address forces it to be early bound. See my comment to Jester above.Tucker
Yes, typo, I meant middle of each PLT entry. Oh, right, indirect jumps that don't target an endbr64. That would be a showstopper for your proposal, since the first call would be an indirect jump/call to the PLT which doesn't start with endbr64. Although I guess you'd have room for an endbr64 since yours wouldn't start with jmp [rip+rel32] as the first instruction. (Thanks for including those ABI discussion links.) I guess in the current design, early binding for functions whose address is taken makes you you don't have an indirect call to a PLT entry (without endbr).Keitel
A
3

The basic issue is that the original call (at 0x500) is being generated by the compiler, and at that point, the compiler does not know whether this symbol will eventually be in this dynamic object or not. So it generates a simple call (direct, PC relative) as that is the most efficient for the common case of a local call within a dynamic object.

It is not until the linker runs that we know if this is a symbol in another dynmic object or a globally visible one in this object (that might be overridden) or a local function call. For the latter case it will just make it a direct call, but for the former cases, it will create a PLT entry for the symbol and make the call go to the PLT entry.

Your suggestion would save a jump, but would require knowing at compile time for every call whether it needs a PLT entry or not, or would require switching between a direct and indirect call at link time based on whether the PLT was needed or not. On x86, direct and indirect calls are different sizes, so being able to change would be pretty tricky.

Audie answered 28/8, 2023 at 4:17 Comment(3)
Calls from a shared library are generated through the PLT even for functions in the same library, by default. Symbols with “default” visibility are interposable, and interposition can happen only at runtime.Tucker
gcc -fno-plt would have the same problem (of unnecessary indirection for symbols that are found in the linker inputs). It's solved by "relaxing" call [rip+rel32] to a32 call rel32 direct calls with a dummy address-size prefix that has no effect on how it executes. (But is needed for the instruction to take the same space in the machine code without inserting a nop.) There's a special relocation type for "relaxable" calls. (example in Can't call C standard library function on 64-bit Linux from assembly (yasm) code - NASM uses non-relaxable :/)Keitel
@OfekShilon: But you don't want that most of the time, and even when compiling with -fPIE or with -fno-pie -no-pie (where even by default GCC will make direct calls to other function), GCC doesn't know whether an undefined symbol will be found in another .o or only in a .so shared object. GCC handles this by either letting the linker rewrite calls to go through the PLT if needed (traditional -fno-pie), or by having the linker relax call foo@plt to call foo (-fPIE without -fno-plt, or visibility=hidden). Or see my previous comment re: relaxing call [rip+rel32].Keitel

© 2022 - 2024 — McMap. All rights reserved.