The way PLT usage is specified in the SystemV ABI (and implemented in practice), is schematically somtehing like this:
# A call from somewhere in code is into a PLT slot
# (In reality not a direct call, in x64 typically an rip-relative one)
0x500:
call 0x1000
...
0x1000:
.PLT1: jmp [0x2000] # the slot for f in the binary's GOT
pushq $index_f
jmp .PLT0
...
0x2000:
# initially jumps back to .PLT to call the lazy-binding routine:
.GOT1: 0x1005
# but after that is called:
0x3000 # the address of the real implementation of f
...
0x3000:
f: ....
My question is:
isn't the 1st jmp
in the PLT slot redundant? Couldn't this work with an indirect call into the GOT instead? For example:
0x500:
call [0x2000]
...
0x1000:
.PLT1: pushq $index_f
jmp .PLT0
...
0x2000:
# initially jumps back to .PLT to call the lazy-binding routine:
.GOT1: 0x1005
# but after that is called:
0x3000 # the address of the real implementation of f
...
0x3000:
f: ....
This might have marginal performance benefits - but the reason I'm asking is a recent scramble in the linkers/elf community to come up with extra bytes in a 16-byte PLT slot to accommodate intel IBT (the search failed, and resulted in an extra .plt.sec
indirection. 1, 2)
push
+jmp
with acall
if the resolver looked at the return address to figure out which function it is. – Pecancall+jmp
equivalent tpcall
ing the jmp destination? (2) You can't replacepush+jmp
withcall
, because after resolution the resolver callsf
and you want itsret
to return to the original call site. – Tuckercall
is in the original caller, the PLT should justjmp
2) you can if the resolver pops off the return address and uses that to determine which function it is. Also the resolver will not callf
either, it will jump to it (or if it does, then it does aret
afterwards). – Pecanjmp [got]
in case anybody needs a function pointer. – Pecangcc -fno-plt
will putcall [rip + foo@GOTPCREL]
into caller so no separatejmp
is needed. But if you do have a PLT, it needs tojmp
to the target function for calls after the initial one. (After lazy resolving. Or for early binding but still using the PLT, the GOT entry will be correct even before the first call so only thejmp [mem]
part ever executes, not the push/jmp.) – Keitel-fno-plt
disables lazy binding entirely - that is not my intention. Seems to me lazy-binding could work with the hypothetical scheme above: (1) the call in code iscall [rip+foo@GOTPCREL]
, (2) the GOT entryrip+foo@GOTPCREL
initially contains the address offoo@PLT
, (3)foo@PLT
sets arguments and calls the resolver which overwrites the GOT entry with the address of realfoo
, (4) on future indirect calls through the GOTcall [rip+foo@GOTPCREL]
would callfoo
's implementation. Why is thejmp
needed? – Tuckerclang --version
), so if you're going to change the traditional mechanism,-fno-plt
style is a good choice. – Keiteljmp
/call
, so every call-site would need to use extra something likecall [ebx+puts@GOT]
or whatever the right@thing
is, after setting up EBX as a GOT pointer in that function. (Which it already needs for accessing global variables). Also, the PLT itself needs a position-independent way to access the GOT. (Traditionally, lazy dynamic linking rewrote a directjmp rel32
in the PLT, not GOT data.) – Keitelauto fptr = &puts;
function pointers. Perhaps just do early binding for those, like now when compiling a PIE, so later calls don't go through the PLT, and code that wants the function pointer just loads directly from the GOT entry. – Keitel.got.plt
absolute addresses to be initialized to point into the middle of each GOT entry. So that's not something that would get worse with your modification. I think PLT entries are usually a fixed size, but I forget if it's normally a power of 2 so they're always aligned. Still, saving space might get them down to 8 bytes. And if only used on the first call, they can be packed without caring about alignment. – Keitelendbr64
. That would be a showstopper for your proposal, since the first call would be an indirect jump/call to the PLT which doesn't start withendbr64
. Although I guess you'd have room for anendbr64
since yours wouldn't start withjmp [rip+rel32]
as the first instruction. (Thanks for including those ABI discussion links.) I guess in the current design, early binding for functions whose address is taken makes you you don't have an indirect call to a PLT entry (without endbr). – Keitel