How well do compilers (with link-time optimization) cope with functions that return quickly (early-out paths)?

Asked 17/4, 2015 at 17:47 Answered 7/7, 2015 at 21:31

c linker compiler-optimization micro-optimization lto

In C, if I have a function call that looks like

// main.c
...
do_work_on_object(object, arg1, arg2);
...

// object.c
void do_work_on_object(struct object_t *object, int arg1, int arg2)
{
  if(object == NULL)
  {
    return;
  }
  // do lots of work
}

then the compiler will generate a lot of stuff in main.o to save state, pass parameters (hopefully in registers in this case), and restore state.

However, at link time it can be observed that arg1 and arg2 are not used in the quick-return path, so the clean-up and state restoration can be short-circuited. Do linkers tend to do this kind of thing automatically, or would one need to turn on link-time optimization (LTO) to get that kind of thing to work?

(Yes, I could inspect the disassembled code, but I'm interested in the behaviours of compilers and linkers in general, and on multiple architectures, so hoping to learn from others' experience.)

Assuming that profiling shows this function call is worth optimizing, should we expect the following code to be noticeably faster (e.g. without the need to use LTO)?

// main.c
...
if(object != NULL)
{
  do_work_on_object(object, arg1, arg2);
}
...

// object.c
void do_work_on_object(struct object_t *object, int arg1, int arg2)
{
  assert(object != NULL) // generates no code in release build
  // do lots of work
}

Merissameristem answered 17/4, 2015 at 17:47 Comment(4)

The jargon term for this optimization is "shrink-wrapping". It is surprisingly difficult to implement in a production compiler; I'm only aware of people even trying to do it for special cases like vtable thunks and PLT stubs. I may be out of date. – Salon 17/4, 2015 at 17:53

Why the downvote? It's an interesting question. – Shulock 17/4, 2015 at 17:53

Andrew Henle Those things are true, but the quick-return case is the thing I asked about. – Merissameristem 17/4, 2015 at 18:43

Shrink-wrapping is definitely a well-known thing, e.g. this 1998 paper citeseerx.ist.psu.edu/viewdoc/… – Merissameristem 17/4, 2015 at 21:49

Some compilers (like GCC and clang) are able to do "shrink-wrap" optimization to delay saving call-preserved regs until after a possible early-out, if they're able to spot the pattern. But some don't, e.g. apparently MSVC 16.11 still doesn't.

I don't think any do partial inlining of just the early-out check into the caller, to avoid even the overhead of arg-passing and the call / ret itself.

Since compiler/linker support for this is not universal and not always successful even for shrink-wrapping, you can write your code in a way that gets much of the benefit, at the cost of splitting the logic of your function into two places.

If you have a fast-path that takes hardly any code, but happens often enough to matter, put that part in a header so it gets inlined, with a fallback to calling the rest of the function (which you make private, so it can assume that any checks in the inlined part are already done).

e.g. par2's routine that processes a block of data has a fast-path for when the galois16 factor is zero. (dst[i] += 0 * src[i] is a no-op, even when * is a multiply in Galois16, and += is a GF16 add (i.e. a bitwise XOR)).

Note how the commit in question renames the old function to InternalProcess, and adds a new template<class g> inline bool ReedSolomon<g>::Process that checks for the fast-path, and otherwise calls InternalProcess. (as well as making a bunch of unrelated whitespace changes, and some ifdefs... It was originally a 2006 CVS commit.)

The comment in the commit claims an overall 8% speed gain for repairing.

Hornblende answered 7/7, 2015 at 21:31 Comment(0)

Neither the setup or cleanup state code can be short-circuited, because the resulted compiled code is static, and it doesn't know what will happen when the program get's executed. So the compiler will always have to setup the whole parameter stack.

Think of two situations: in one object is nil, in the other is not. How will the assembly code know if to put on the stack the rest of the argument? Especially as the caller is the one responsible of placing the arguments at their proper location (stack or registry).

Soothfast answered 17/4, 2015 at 19:11 Comment(13)

Setup indeed cannot be short-circuited, because the conditional inside the function has not been executed. But clean-up can be short-circuited because the linker can reason about the machine state at the exit point and the state after the clean-up code. There's no need to (say) pop registers that would have been polluted by the real work, on a path where that work does not occur. My question is whether linkers tend to do this. – Merissameristem 17/4, 2015 at 19:27

The linker cannot make any assumptions, compilation is made only once, while the program can execute multiple times. And you cannot short-circuit the cleanup as you'd leak memory if the caller has already placed some data on the stack. Function prologue and epilogue are not optional. And not lastly, it's not the linker that does this, it's the compiler, linker's role is another one. – Soothfast 17/4, 2015 at 19:34

Alternatively, the linker could save such registers after the branch for the quick return point. This is tantamount to the refactoring in my example. – Merissameristem 17/4, 2015 at 19:35

This can't be the compiler's job, it can't see two translation units. – Merissameristem 17/4, 2015 at 19:36

In the case of an early return, the code jumps straight to the function epilogue, which restores the registry. No optimisation needs to be made. And yes, it's the compiler who generates the code, the linker makes sure that the called functions reach a valid offset, it doesn't affect the object code. – Soothfast 17/4, 2015 at 19:38

Yes, the compiler must be conservative in each translation unit. But the linker can definitely change the object code beyond generating offsets, or LTO would not be a thing. – Merissameristem 17/4, 2015 at 19:44

I think we're deviating from the original discussion. The point is that there's no optimisation to be done (see my previous comment why). – Soothfast 17/4, 2015 at 19:46

Let us continue this discussion in chat. – Merissameristem 17/4, 2015 at 20:12

Any initialization done by the caller (pushing arguments, e.g.) can indeed only be skipped if the compiler is able to hoist the early return all the way into the caller, which is much more of an LTO thing than a linker-relaxation thing. However, in principle the linker (or, more likely, the compiler) could perfectly well move the initial test arg_reg_0; return_if_zero; instructions above all of the callee's prologue setup work. And, much to my surprise, gcc 4.9 can do that. (clang 3.5 can't.) – Salon 17/4, 2015 at 21:28

(I'm surprised because that is exactly what I thought was still "surprisingly difficult to implement in a production compiler". Maybe I just underestimated the degree to which GCC's having finally killed off the infamous 'reload' pass means things that used to be hard aren't anymore.) – Salon 17/4, 2015 at 21:30

We should run some benchmarks on this, as the performance gain is noticeable due to the missing prologue/epilogue. – Soothfast 17/4, 2015 at 21:33

I have a use case in mind, and access to lots of hardware and compilers, so I can explore benchmarks and/or disassembly. I now see that the prevailing calling convention is going to be relevant to the results. – Merissameristem 17/4, 2015 at 21:55

LLVM notes that this is an area they hope to improve: llvm.org/OpenProjects.html#codegen – Merissameristem 17/4, 2015 at 22:5

Recommended topics

Hot tags