Why does the PLT exist in addition to the GOT, instead of just using the GOT?
Asked Answered
W

2

45

I understand that in a typical ELF binary, functions get called through the Procedure Linkage Table (PLT). The PLT entry for a function usually contains a jump to a Global Offset Table (GOT) entry. This entry will first reference some code to load the actual function address into the GOT, and contain the actual function address after the first call (lazy binding).

To be precise, before lazy binding the GOT entry points back into the PLT, to the instructions following the jump into the GOT. These instructions will usually jump to the head of the PLT, from where some binding routine gets called which will then update the GOT entry.

Now I'm wondering why there are two indirections (calling into the PLT and then jumping to an address from the GOT), instead of just sparing the PLT and calling the address from the GOT directly. It looks like this could save a jump and the complete PLT. You would of course still need some code calling the binding routine, but this can be outside the PLT.

Is there anything I am missing? What is/was the purpose of an extra PLT?


Update: As suggested in the comments, I created some (pseudo-) code ASCII art to further explain what I'm referring to:

This is the situation, as far as I understand it, in the current PLT scheme before lazy binding: (Some indirections between the PLT and printf are represented by "...".)

Program                PLT                                 printf
+---------------+      +------------------+                +-----+
| ...           |      | push [0x603008]  |<---+       +-->| ... |
| call j_printf |--+   | jmp [0x603010]   |----+--...--+   +-----+
| ...           |  |   | ...              |    |
+---------------+  +-->| jmp [printf@GOT] |-+  |
                       | push 0xf         |<+  |
                       | jmp 0x400da0     |----+
                       | ...              |
                       +------------------+

… and after lazy binding:

Program                PLT                       printf
+---------------+      +------------------+      +-----+
| ...           |      | push [0x603008]  |  +-->| ... |
| call j_printf |--+   | jmp [0x603010]   |  |   +-----+
| ...           |  |   | ...              |  |
+---------------+  +-->| jmp [printf@GOT] |--+
                       | push 0xf         |
                       | jmp 0x400da0     |
                       | ...              |
                       +------------------+

In my imaginary alternative scheme without a PLT, the situation before lazy binding would look like this: (I kept the code in the "Lazy Binding Table" similar to to the one from the PLT. It could also look differently, I don't care.)

Program                    Lazy Binding Table                printf
+-------------------+      +------------------+              +-----+
| ...               |      | push [0x603008]  |<-+       +-->| ... |
| call [printf@GOT] |--+   | jmp [0x603010]   |--+--...--+   +-----+
| ...               |  |   | ...              |  |
+-------------------+  +-->| push 0xf         |  |
                           | jmp 0x400da0     |--+
                           | ...              |
                           +------------------+

Now after the lazy binding, one wouldn't use the table anymore:

Program                   Lazy Binding Table        printf
+-------------------+     +------------------+      +-----+
| ...               |     | push [0x603008]  |  +-->| ... |
| call [printf@GOT] |--+  | jmp [0x603010]   |  |   +-----+
| ...               |  |  | ...              |  |
+-------------------+  |  | push 0xf         |  |
                       |  | jmp 0x400da0     |  |
                       |  | ...              |  |
                       |  +------------------+  |
                       +------------------------+
Whisk answered 27/3, 2017 at 14:18 Comment(0)
A
36

The problem is that replacing call printf@PLT with call [printf@GOTPLT] requires that the compiler knows that the function printf exists in a shared library and not a static library (or even in just a plain object file). The linker can change call printf into call printf@PLT, jmp printf into jmp printf@PLT or even mov eax, printf into mov eax, printf@PLT because all it's doing it changing a relocation based on the symbol printf into relocation based on the symbol printf@PLT. The linker can't change call printf into call [printf@GOTPLT] because it doesn't know from the relocation whether it's a CALL or JMP instruction or something else entirely. Without knowing whether it's a CALL instruction or not, it doesn't know whether it should change the opcode from a direct CALL to a indirect CALL.

However even if there was a special relocation type that indicated that the instruction was a CALL, you still have the problem that a direct call instruction is a 5 bytes long but a indirect call instruction is 6 bytes long. The compiler would have to emit code like nop; call printf@CALL to give the linker room to insert the additional byte needed and it would have to do it for all calls to any global function. It would probably end up being a net performance loss because of all the extra and not actually necessary NOP instructions.

Another problem is that on 32-bit x86 targets the PLT entries are relocated at runtime. The indirect jmp [xxx@GOTPLT] instructions in the PLT don't use relative addressing like the direct CALL and JMP instructions, and since the address of xxx@GOTPLT depends on where the image was loaded in memory the instruction needs to be fixed up to use the correct address. By having all these indirect JMP instructions grouped together in one .plt section means that much smaller number of virtual memory pages need to be modified. Each 4K page that's modified can no longer be shared with other processes, when the instructions that need to modified are scattered all over memory it requires that a much larger part the image to be unshared.

Note that this later issue is only a problem with shared libraries and position independent executables on 32-bit x86 targets. Traditional executables can't be relocated, so there's no need to fix the @GOTPLT references, while on 64-bit x86 targets RIP relative addressing is used to access the @GOTPLT entries.

Because of that last point new versions of a GCC (6.1 or later) support the -fno-plt flag. On 64-bit x86 targets this option causes the compiler to generate call printf@GOTPCREL[rip] instructions instead of call printf instructions. However it appears to do this for any call to a function that isn't defined in the same compilation unit. That is any function it doesn't know for sure isn't defined in shared library. That would mean that indirect jumps would also be used for calls to functions defined in other object files or static libraries. On 32-bit x86 targets the -fno-plt option is ignored unless compiling position independent code (-fpic or -fpie) where it results in call printf@GOT[ebx] instructions being emitted. In addition to generating unnecessary indirect jumps, this also has the disadvantage of requiring the allocation of a register for the GOT pointer though most functions would need it allocated anyways.

Finally, Windows is able to do what you suggest by declaring symbols in header files with the "dllimport" attribute, indicating that they exist in DLLs. This way the compiler knows whether or not to generate direct or indirect call instruction when calling the function. The disadvantage of this is that the symbol has to exist in a DLL, so if this attribute used is you can't decide after compilation to link with a static library instead.

Read also Drepper's How to write a shared library paper, it explains that quite well in details (for Linux).

Alvie answered 28/3, 2017 at 19:27 Comment(2)
IIRC, the linker can relax an indirect call myfunc@GOTPCREL[rip] into call myfunc if it does find myfunc is available directly to be linked into the same library. (And IIRC it uses a segment override prefix to pad the call rel32 to fill the 6-byte slot).Basilbasilar
IIRC, Assuming indirect for any function call not in the same compilation unit doesn't happen for -fPIE or for non-pie executables, only with -fPIC. (related: there are options to set default symbol visibility to control whether symbol-interposition has to be assumed or not.)Basilbasilar
H
4

Now I'm wondering why there are two indirections (calling into the PLT and then jumping to an address from the GOT),

First of all there are two calls, but just one indirection (call to PLT stub is direct).

instead of just sparing the PLT and calling the address from the GOT directly.

In case you do not need lazy binding, you can use -fno-plt which bypasses the PLT.

But if you wanted to keep it, you'd need some stub code to see if symbol has been resolved and branch accordingly. Now, to facilitate branch prediction, this stub code has to be duplicated for every called symbol and voila, you re-invented the PLT.

Hylophagous answered 27/3, 2017 at 14:46 Comment(6)
1. By "direct" you mean the call target is static and not read from memory? This is of course true, but there still is an unnecessary jump in addition to the call (one call and one jump in total). An unconditional jump might not be a big deal on modern x86, but that might not be true for all architectures and it's definitely not beneficial for code cache locality.Whisk
2. My "re-invented" PLT is similar to the original one in that it might contain binding stubs for all functions. But the important difference to me ist that not every call has to go from the PLT to the GOT (and back once). Instead, it goes directly to the GOT and back to the "re-invented" PLT for the first call.Whisk
@Whisk "By direct you mean the call target is static and not read from memory" - not just mean it, this is definition of direct call. They certainly have their cost but it's (much) lower than indirects so it's important to be precise.Hylophagous
@Whisk "not every call has to go from the PLT to the GOT" - but in your case after fetching the address from GOT and then jumping, code has to jump again in the stub, depending on whether address was resolved or not. Note that the second jump would be conditional indirect jump which is much heavier then plain direct jump to PLT. So your approach trades 1 direct and 1 indirect jump to 2 indirect jumps (1 being conditional). In case you had something else in mind, I suggest you add a pseudo-code to the question.Hylophagous
No, I don't want to use the stub after lazy binding. I updated the question with some ASCII art images to explain what I have in mind.Whisk
@Whisk Nice catch but won't work if initial GOT address is stored in register (.e.g when first call to some math function is made from a long-running loop) - app will keep using the slow path even after it's resolved.Hylophagous

© 2022 - 2024 — McMap. All rights reserved.