Why use the Global Offset Table for symbols defined in the shared library itself?
Asked Answered
U

2

9

Consider the following simple shared library source code:

library.cpp:

static int global = 10;

int foo()
{
    return global;
}

Compiled with -fPIC option in clang, it results in this object assembly (x86-64):

foo(): # @foo()
  push rbp
  mov rbp, rsp
  mov eax, dword ptr [rip + global]
  pop rbp
  ret
global:
  .long 10 # 0xa

Since the symbol is defined inside the library, the compiler is using a PC relative addressing as expected: mov eax, dword ptr [rip + global]

However if we change static int global = 10; to int global = 10; making it a symbol with external linkage, the resulting assembly is:

foo(): # @foo()
  push rbp
  mov rbp, rsp
  mov rax, qword ptr [rip + global@GOTPCREL]
  mov eax, dword ptr [rax]
  pop rbp
  ret
global:
  .long 10 # 0xa

As you can see the compiler added a layer of indirection with the Global Offset Table, which seems totally unnecessary in this case as the symbol is still defined inside the same library (and source file).

If the symbol was defined in another shared library, the GOT would be necessary, but in this case it feels redundant. Why is the compiler still adding this symbol to the GOT?

Note: I believe this question is similiar to this, however the answer was not pertinent maybe due to a lack of details.

Uprear answered 9/4, 2019 at 7:31 Comment(8)
The fact is that shared library symbols can be redefined by other libraries. So the code could end using a new symbol in another library. Making it external (i.e. public) you are allowing it to be redefined. I don't remember the exact name of this feature.Dauphine
Isn't that violating the ODR rule?Uprear
I don't remember the exact details but ODR is a C++ thing, while this is a loader mechanism. Each shared lib has only one definition of the symbol. Actually "redefinition" is not the right term but I don't remember the technical one.Dauphine
Ok, found it. A symbol can be interposed.Dauphine
@MargaretBloom: Yup, that's the blog post I was going to link for more about Linux/Unix dynamic linking. Inefficient access to your own globals and functions within a shared library is why you want to set the ELF visibility to hidden if you won't need/want the symbol to participate in symbol interposition, so other libraries the define/use the same name have their own private copy of the symbol definition.Monsignor
@PeterCordes It may be worth to write a short answer about this, with a link to that blog post. It's not that easily found, alas.Dauphine
@MargaretBloom: If I get around to it before someone else does, then yeah maybe :)Monsignor
@PeterCordes (and MargaretBloom) Thanks for the link and explaination, very helpful. If you write it as an answer I will accept it.Uprear
F
9

The Global Offset Table serves two purposes. One is to allow the dynamic linker "interpose" a different definition of the variable from the executable or other shared object. The second is to allow position independent code to be generated for references to variables on certain processor architectures.

ELF dynamic linking treats the entire process, the executable and all of the shared objects (dynamic libraries), as sharing one single global namespace. If multiple components (executable or shared objects) define the same global symbol then the dynamic linker normally chooses one definition of that symbol and all references to that symbol in all components refer to that one definition. (However, the ELF dynamic symbol resolution is complex and for various reasons different components can end up using different definitions of the the same global symbol.)

To implement this, when building a shared library the compiler will access global variables indirectly through the GOT. For each variable an entry in the GOT will be created containing a pointer to the variable. As your example code shows, the compiler will then use this entry to obtain the address of variable instead of trying to access it directly. When the shared object is loaded into a process the dynamic linker will determine whether any of the global variables have been superseded by variable definitions in another component. If so those global variables will have their GOT entries updated to point at the superseding variable.

By using the "hidden" or "protected" ELF visibility attributes it's possible to prevent global defined symbol from being superseded by a definition in another component, and thus removing the need to use the GOT on certain architectures. For example:

extern int global_visible;
extern int global_hidden __attribute__((visibility("hidden")));
static volatile int local;  // volatile, so it's not optimized away

int
foo() {
    return global_visible + global_hidden + local;
}

when compiled with -O3 -fPIC with the x86_64 port of GCC generates:

foo():
        mov     rcx, QWORD PTR global_visible@GOTPCREL[rip]
        mov     edx, DWORD PTR local[rip]
        mov     eax, DWORD PTR global_hidden[rip]
        add     eax, DWORD PTR [rcx]
        add     eax, edx
        ret 

As you can see, only global_visible uses the GOT, global_hidden and local don't use it. The "protected" visibility works similarly, it prevents the definition from being superseded but makes it still visible to the dynamic linker so it can be accessed by other components. The "hidden" visibility hides the symbol completely from the dynamic linker.

The necessity of making code relocatable in order allow shared objects to be loaded a different addresses in different process means that statically allocated variables, whether they have global or local scope, can't be accessed with directly with a single instruction on most architectures. The only exception I know of is the 64-bit x86 architecture, as you see above. It supports memory operands that are both PC-relative and have large 32-bit displacements that can reach any variable defined in the same component.

On all the other architectures I'm familiar with accessing variables in position dependent manner requires multiple instructions. How exactly varies greatly by architecture, but it often involves using the GOT. For example, if you compile the example C code above with x86_64 port of GCC using the -m32 -O3 -fPIC options you get:

foo():
        call    __x86.get_pc_thunk.dx
        add     edx, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_
        push    ebx
        mov     ebx, DWORD PTR global_visible@GOT[edx]
        mov     ecx, DWORD PTR local@GOTOFF[edx]
        mov     eax, DWORD PTR global_hidden@GOTOFF[edx]
        add     eax, DWORD PTR [ebx]
        pop     ebx
        add     eax, ecx
        ret
__x86.get_pc_thunk.dx:
        mov     edx, DWORD PTR [esp]
        ret

The GOT is used for all three variable accesses, but if you look closely global_hidden and local are handled differently than global_visible. With the later, a pointer to the variable is accessed through the GOT, with former two variables they're accessed directly through the GOT. This a fairly common trick among architectures where the GOT is used for all position independent variable references.

The 32-bit x86 architecture is exceptional in one way here, since it has large 32-bit displacements and a 32-bit address space. This means that anywhere in memory can be accessed through the GOT base, not just the GOT itself. Most other architectures only support much smaller displacements, which makes the maximum distance something can be from the GOT base much smaller. Other architectures that use this trick will only put small (local/hidden/protected) variables in the GOT itself, large variables are stored outside the GOT and the GOT will contain a pointer to the variable just like with normal visibility global variables.

Ferne answered 9/4, 2019 at 18:25 Comment(12)
In your i386 PIC example, the variables aren't allocated inside the GOT, just reference relative to it. GCC asks the linker to calculate a displacement from the GOT to local with local@GOTOFF. We can see this on Godbolt godbolt.org/z/0Zu-RM by looking at the directives: local is defined in .data, not in any special section. (I used -g0 so I could look at directives without the clutter of debug directives.) And I made the other vars defined, not extern. global_visible ends up next to the other two.Monsignor
Or are you saying that the GOT encompasses the entire 4GB of address space that can be referenced with a 32-bit pointer, including all other sections? (Since 32-bit pointers wrap at 4G, a disp32 can reach anywhere in the whole 4GB from any starting point. You get a 2GB size limit on x86-64 where you want to be able to reach anywhere from anywhere with a signed 32-bit displacement added to 64-bit pointers, so you can't wrap around.)Monsignor
@PeterCordes You're right that on i386 the variables aren't actually located in one contiguous GOT, as they don't need to be. I was assuming that they were based on the code generated and how it works other platforms.Ferne
@PeterCordes Thanks for the clarification. I am not sure I understand why the i386 uses an offset from the GOT rather than the program counter. Is this because PC relative addressing is more cumbersome on x86?Uprear
@A.S.: x86-64 added a PC-relative addressing mode for 64-bit mode only. It's not available in 32-bit mode. But yes, good point, if we remove the global_visible then there's no need to calculate the GOT address at all, gcc should skip that step and simply use mov eax, local-.Lpc_base[edx] to reference it relative to the return address of the thunk that reads EIP into EDX (i.e. put a .Lpc_base label on the instruction after the call). But it instead still adds the GOT offset (godbolt.org/z/Yek-6q), so this is a missed optimization.Monsignor
I guess the assumption is that most functions are going to want to reference something relative to the GOT, though, and of course you only want 1 register as the PIC base address, and you have it pointing to the GOT. (Traditionally EBX, but modern gcc can thunk into other registers to allow better register allocation.) Fortunately 32-bit x86 is obsolete, so IDK if anyone wants to implement code in the compiler to look for this optimization in functions that only access private static data. It saves one add instruction per such function.Monsignor
@A.S. I don't think i386 ELF has the relocations that make that work. Like Peter Cordes said you want to only use one register as a base register, and the @GOT and @GOTOFF relocations require that this base register point to the GOT. Instead of the @GOTOFF relocations you could do something like global_hidden - foo - 5[edx], but this would require require EDX not be adjusted after the call to the thunk, while global_visible@GOT requires the adjusted EDX.Ferne
That makes sense. For functions which access something in the GOT it would result in adding/subtracting the GOT offset (or worse, waste another base register). The compiler apparently just seems to make the assumption there will be something referencing the GOT, even when there isn't.Uprear
@RossRidge: I had a look at a non-x86 ISA (MIPS and MIPS64) to see what you were saying about allocating in the GOT. gcc doesn't do that (at least not by default). godbolt.org/z/J_UnFd shows all 3 vars use chain of 2 loads each. But the static local uses ld $4,%got_page(local)($5) to get a base pointer, and lw $4,%got_ofst(local)($4) to use an immediate offset relative to that page. The others all load a pointer from the GOT for that variable specifically, but local can share the same got-page pointer with lots of others, giving cache hits. IDK why hidden can't do that.Monsignor
@A.S. and Ross: ELF does allow PC-relative relocations with an offset, like for rel32. Thus we can write .Lpc_base: nop;nop; mov $sym1 - .Lpc_base, %eax and have it assemble successfully; disassembles as b8 03 00 00 00 mov $0x3,%eax 3: R_386_PC32 sym1. (It's essential that we define .Lpc_base in the current file; mov $sym1 - sym2, %eax won't assemble.) I put a couple nop instructions first to prove that it can be relative to any known address, not necessarily the start of the current instruction.Monsignor
(forgot to mention for MIPS: linux-mips.org/wiki/PIC_code explains that PIC functions are called with their own address in $25, aka $t9, in case anyone's wondering about the "weird" daddu that reads $25, or .cpload.)Monsignor
@PeterCordes The relocation that i386 ELF is missing that would make it possible to use an offset from the program counter in instead of the GOT as the base is one that would let you access the GOT entry for a variable using the unadjusted return value of the thunk. Something like mov _GLOBAL_OFFSET_TABLE_ - .Lpc_base + global_visible@GOT(%edx), %ebx, where .Lpc_base is a label pointing the to the instruction after the thunk call instruction. My previous comment already addressed you could replace @GOTOFF using an expression that generates R_386_PC relocation.Ferne
S
0

In addition to details in Ross Ridge answer.

This is external vs internal linkage. Without static that variable has external linkage and is, therefore, accessible from any other translation unit. Any other translation unit can declare it as extern int global; and access it.

Linkage:

External linkage. The name can be referred to from the scopes in the other translation units. Variables and functions with external linkage also have language linkage, which makes it possible to link translation units written in different programming languages.

Any of the following names declared at namespace scope have external linkage unless the namespace is unnamed or is contained within an unnamed namespace (since C++11):

  • variables and functions not listed above (that is, functions not declared static, namespace-scope non-const variables not declared static, and any variables declared extern);
Synergy answered 9/4, 2019 at 18:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.