gcc vs clang: inlining a function with -fPIC
Asked Answered
D

1

9

Consider this code:

// foo.cxx
int last;

int next() {
  return ++last;
}

int index(int scale) {
  return next() << scale;
}

When compiling with gcc 7.2:

$ g++ -std=c++11 -O3 -fPIC

This emits:

next():
    movq    last@GOTPCREL(%rip), %rdx
    movl    (%rdx), %eax
    addl    $1, %eax
    movl    %eax, (%rdx)
    ret
index(int):
    pushq   %rbx
    movl    %edi, %ebx
    call    next()@PLT    ## next() not inlined, call through PLT
    movl    %ebx, %ecx
    sall    %cl, %eax
    popq    %rbx
    ret

However, when compiling the same code with the same flags using clang 3.9 instead:

next():                               # @next()
    movq    last@GOTPCREL(%rip), %rcx
    movl    (%rcx), %eax
    incl    %eax
    movl    %eax, (%rcx)
    retq

index(int):                              # @index(int)
    movq    last@GOTPCREL(%rip), %rcx
    movl    (%rcx), %eax
    incl    %eax              ## next() was inlined!
    movl    %eax, (%rcx)
    movl    %edi, %ecx
    shll    %cl, %eax
    retq

gcc calls next() via the PLT, clang inlines it. Both still lookup last from the GOT. For compiling on linux, is clang right to make that optimization and gcc is missing out on easy inlining, or is clang wrong to make that optimization, or is this purely a QoI issue?

Dodecanese answered 30/8, 2017 at 23:24 Comment(0)
P
14

I don't think the standard goes into that much detail. It merely says that roughly if the symbol has external linkage in different translation units, it is the same symbol. That makes clang's version correct.

From that point on, to the best of my knowledge, we're out of the standard. Compilers choices differ on what they consider a useful -fPIC output.

Note that g++ -c -std=c++11 -O3 -fPIE outputs:

0000000000000000 <_Z4nextv>:
   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6 <_Z4nextv+0x6>
   6:   83 c0 01                add    $0x1,%eax
   9:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # f <_Z4nextv+0xf>
   f:   c3                      retq   

0000000000000010 <_Z5indexi>:
  10:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 16 <_Z5indexi+0x6>
  16:   89 f9                   mov    %edi,%ecx
  18:   83 c0 01                add    $0x1,%eax
  1b:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 21 <_Z5indexi+0x11>
  21:   d3 e0                   shl    %cl,%eax
  23:   c3                      retq

So GCC does know how to optimize this. It just chooses not to when using -fPIC. But why? I can see only one explanation: make it possible to override the symbol during dynamic linking, and see the effects consistently. The technique is known as symbol interposition.

In a shared library, if index calls next, as next is globally visible, gcc has to consider the possibility that next could be interposed. So it uses the PLT. When using -fPIE however, you are not allowed to interpose symbols, so gcc enables the optimization.

So is clang wrong? No. But gcc seems to provide better support for symbol interposition, which is handy for instrumenting the code. It does so at the cost of some overhead if one uses -fPIC instead of -fPIE for building his executable though.


Additional notes:

In this blog entry from one of gcc developers, he mentions, around the end of the post:

While comparing some benchmarks to clang, I noticed that clang actually ignore ELF interposition rules. While it is bug, I decided to add -fno-semantic-interposition flag to GCC to get similar behaviour. If interposition is not desirable, ELF's official answer is to use hidden visibility and if the symbol needs to be exported define an alias. This is not always practical thing to do by hand.

Following that lead landed me on the x86-64 ABI spec. In section 3.5.5, it does mandate that all functions calling a globally visible symbols must go through the PLT (it goes as far as defining the exact instruction sequence to use depending on memory model).

So, though it does not violate C++ standard, ignoring semantic interposition seems to violate the ABI.


Last word: didn't know where to put this, but it might be of interest to you. I'll spare you the dumps, but my tests with objdump and compiler options showed that:

On the gcc side of things:

  • gcc -fPIC: accesses to last goes through GOT, calls to next() goes through PLT.
  • gcc -fPIC -fno-semantic-interposition: last goes through GOT, next() is inlined.
  • gcc -fPIE: last is IP-relative, next() is inlined.
  • -fPIE implies -fno-semantic-interposition

On the clang side of things:

  • clang -fPIC: last goes through GOT, next() is inlined.
  • clang -fPIE: last goes through GOT, next() is inlined.

And a modified version that compiles to IP-relative, inlined on both compilers:

// foo.cxx
int last_ __attribute__((visibility("hidden")));
extern int last __attribute__((alias("last_")));

int __attribute__((visibility("hidden"))) next_()
{
  return ++last_;
}
// This one is ugly, because alias needs the mangled name. Could extern "C" next_ instead.
extern int next() __attribute__((alias("_Z5next_v")));

int index(int scale) {
  return next_() << scale;
}

Basically, this explicitly marks that despite making them available globally, we use hidden version of those symbols that will ignore any kind of interposition. Both compilers then fully optimize the accesses, regardless of passed options.

Plenary answered 31/8, 2017 at 2:26 Comment(15)
@Barry> Glad you found it useful, I did learn interesting stuff investigating this question. I was curious about last as well and did a few tests; added some findings to the post.Plenary
What's the advantage of the alias attribute over visibility("hidden")?Dodecanese
@Barry> that what just to keep the attribute visible, so that the semantic remains equivalent, but for the lack of interposition support. Granted, if there is not intent of actually exporting them, one can just do away with the aliases and have visibility("hidden"). I didn't test it btw, but I believe compiling with -fvisibility=hidden -fvisibility-inlines-hidden and manually marking relevant symbols with __attribute__((visibility("default"))) should yield the same result.Plenary
This seems to imply that any codebase written with shared libraries (common nowadays) which is very likely to be -fpic compiled, which doesn't bother with symbol visibility and leaves it at the linux default (visible), is going to have very crippled performance on gcc because things cannot be inlined (even functions declared in header files, which will be inline but usually not static). Can this really be right? Is there a reason why the effect is less dramatic than I'm understanding it to be?Sunfast
@NirFriedman> I would suspect that such cases are not that common. First, because the cost is limited: PLT is very likely to be in cache already, call is unconditional so cpu should have no issue with predictive execution, and functions small enough to be inlined are likely to have all there arguments passed as registers anyway. Add to that that it only occurs when there is an inlining opportunity, which may not be that frequent, and that most code is not performance-critical to begin with.Plenary
@NirFriedman> In addition, the trend seems to be to compile more and more with -fvisibility=hidden, manually exporting relevant symbols. That cuts down on symbols, drastically improving both static linking and loading (dynamic linking) times. And avoids accidental symbol interposition, which is a thing in C: it's easy to prefix public API with lib name, but it's also easy to forget internal functions get exported by default too.Plenary
@Nir Friedman I don't know about crippling, but it is a performance hit. Obviously the context matters and rather or not the other code in the hot path inundates the instruction cache, etc. To my understanding, -flto still isn't stable enough to use it to build a whole system, but that will solve a lot of these problems as well (once it matures).Dufrene
Could the last code snipest be simpler if one set the visibility to protected?Agripinaagrippa
@Oliv> one would believe so, and clang does output what one would expect. However, a quick test shows that gcc sticks to using the GOT for last variable despite protected visibility. It does inline the call to next though.Plenary
@Plenary Do you think gcc does it to follow x86-abi spec or gcc is just missing an optimization oportunity?Agripinaagrippa
@Oliv> interestingly, with protected visibility on last, clang -O2 -fPIC -shared foo.cxx fails with error relocation R_X86_64_PC32 against protected symbol 'last' can not be used when making a shared object. So I would say that clang (3.8.0)'s implementation is dubious.Plenary
@oliv> digging more, it seems clang is plain wrong on that one. A protected data symbol can be external. Marking it as protected only ensures it cannot be interposed. However, with copy-relocation, it will be transferred to the executable's .bss section at runtime. Which won't work with clang's output. Said otherwise, clang 3.8.0 fails to compile shared objects that include protected data symbols.Plenary
@Plenary I have continue to digg in that direction. This clang bug can be fixed by declaring the variable extern and defining it an other source file. I have checked with readelf, the 3 symbols have the same protected visibility. More over, clang and gcc both load last from a dynamic location :)!Agripinaagrippa
@Plenary Finally, it seems that Clang is not wrong, putting protected visibility in the header file will also lead to an error with gcc but at link time, when linking the shared library with an executable.Agripinaagrippa
@Oliv> interesting as it is, it's getting too far from this example to follow and be sure we talk about the exact same thing :)Plenary

© 2022 - 2024 — McMap. All rights reserved.