Is returning a 2-tuple less efficient than std::pair?
Asked Answered
R

1

24

Consider this code:

#include <utility>
#include <tuple>

std::pair<int, int> f1()
{
    return std::make_pair(0x111, 0x222);
}

std::tuple<int, int> f2()
{
    return std::make_tuple(0x111, 0x222);
}

Clang 3 and 4 generate similar code for both on x86-64:

f1():
 movabs rax,0x22200000111
 ret    
f2():
 movabs rax,0x11100000222 ; opposite packing order, not important
 ret    

But Clang 5 generates different code for f2():

f2():
 movabs rax,0x11100000222
 mov    QWORD PTR [rdi],rax
 mov    rax,rdi
 ret    

As do GCC 4 through GCC 7:

f2():
 movabs rdx,0x11100000222
 mov    rax,rdi
 mov    QWORD PTR [rdi],rdx ; GCC 4-6 use 2 DWORD stores
 ret

Why is the generated code worse when returning a std::tuple that fits in a single register, vs std::pair? It seems especially strange since Clang 3 and 4 seemed to be optimal, yet 5 is not.

Try it here: https://godbolt.org/g/T2Yqrj

Rectus answered 24/10, 2017 at 3:24 Comment(6)
Related: Why is std::pair faster than std::tupleJeanniejeannine
gcc.gnu.org/bugzilla/show_bug.cgi?id=80792 gcc.gnu.org/bugzilla/show_bug.cgi?id=78302 gcc.gnu.org/bugzilla/show_bug.cgi?id=71301Renascence
@JohnZwinck - you have any link for godbolt or any other evidence that gcc actually produced the code you show for f2() for any version from 4 up to (but not including) 7? Your link has clang and all gcc versions I tried showed different code with two DWORD stores to memory.Aerugo
@BeeOnRope: GCC 4 through 6 use two DWORD stores as you say. GCC 7 uses a single QWORD store as shown in my question. The distinction between these two cases wasn't important to me, so I didn't call it out.Rectus
Well it's important in as much as two stores sucks pretty much twice as much as one store, so since the focus is on optimization or lack thereof, it would be good to point out that most versions of gcc outside the bleeding edge are even worse (unlike the underlying ABI issue though, this one can be solved by more optimization).Aerugo
@BeeOnRope: That's fair. I've added a comment next to the QWORD asm.Rectus
A
30

The short answer is because the libstc++ standard library implementation used by gcc and clang on Linux implements std::tuple with a non-trivial move constructor (in particular, the _Tuple_impl base class has a non-trivial move constructor). On the other hand, the copy and move constructors for std::pair are all defaulted.

This in turn causes a C++-ABI related difference in the calling convention for returning these objects from functions, as well as passing them by value.

The Gory Details

You ran your tests on Linux, which adheres to the SysV x86-64 ABI. This ABI has specific rules for passing or returning classes or structures to functions, which you can read more about here. The specific case we are interested in with whether the two int fields in these structures will get the INTEGER class or the MEMORY class.

A recent version of the ABI specification has this to say:

The classification of aggregate (structures and arrays) and union types works as follows:

  1. If the size of an object is larger than eight eightbytes, or it contains un- aligned fields, it has class MEMORY 12 .
  2. If a C++ object has either a non-trivial copy constructor or a non-trivial destructor 13 , it is passed by invisible reference (the object is replaced in the parameter list by a pointer that has class INTEGER) 14 .
  3. If the size of the aggregate exceeds a single eightbyte, each is classified separately. Each eightbyte gets initialized to class NO_CLASS.
  4. Each field of an object is classified recursively so that always two fields are considered. The resulting class is calculated according to the classes of the fields in the eightbyte

It is condition (2) that applies here. Note that it mentions only copy constructors, and not move constructors - but it is fairly apparently that just is probably just a defect in the specification given the introduction of move constructors which generally need to be included in any classification algorithm where copy constructors were included before. In particular, IA-64 cxx-abi, which gcc is documented to follow does include move constructors:

If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference. Specifically:

  • Space is allocated by the caller in the usual manner for a temporary, typically on the stack.

and then the definition of non-trivial:

A type is considered non-trivial for the purposes of calls if:

  • it has a non-trivial copy constructor, move constructor, or destructor, or
  • all of its copy and move constructors are deleted.

So because tuple is not considered to be trivially copyable from an ABI perspective, it gets MEMORY treatment, which means that your function must populate the stack allocated object passed in by the called in rdi. The std::pair function can just pass back the entire structure in rax since it fits in one EIGHTBYTE and has class INTEGER.

Does it matter? Yeah, strictly speaking, a standalone function like the one you have compiled will be less efficient for tuple since this ABI different is "baked in".

Often however, the compiler will be able to see the body of the function and inline it or perform inter-procedural analysis even if not inlined. In both cases, the ABI is no longer important and it is likely both approaches would be equally efficient, at least with a decent optimizer. For example let's call your f1() and f2() functions and do some math on the result:

int add_pair() {
  auto p = f1();
  return p.first + p.second;
}

int add_tuple() {
  auto t = f2();
  return std::get<0>(t) + std::get<1>(t);
}

In principle the add_tuple method starts from a disadvantage, since it has to call f2() which is less efficient and it also has to create a temporary tuple object on the stack so it can pass it to f2 as the hidden parameter. Well no matter, both functions are fully optimized to just return the right value directly:

add_pair():
  mov eax, 819
  ret
add_tuple():
  mov eax, 819
  ret

So overall you can say that the effect of this ABI issue with tuple will be relatively muted: it adds a small fixed overhead to functions that must comply with the ABI, but this will only really matter in a relative sense for very small functions - but such functions are likely to be declared in a place where they can be inlined (or if not, you are leaving performance on the table).

libcstc++ vs libc+++

As explained above, this is an ABI issue, not an optimization issue, per se. Both clang and gcc are already optimizing the library code to maximum extent possible under the constraints of the ABI - if they generated code like f1() for the std::tuple case they would break ABI compliant callers.

You can see this clearly if you switch to using libc++ rather than the Linux default of libstdc++ - this implementation doesn't have the explicit move constructor (as Marc Glisse mentions in the comments, they are stuck with this implementation for backwards compatibility). Now clang (and presumably gcc although I didn't try it), generates the same optimal code in both cases:

f1():                                 # @f1()
        movabs  rax, 2345052143889
        ret
f2():                                 # @f2()
        movabs  rax, 2345052143889
        ret

Earlier Versions of Clang

Why do versions of clang compile it differently? It was simply a bug in clang or a bug in the spec depending on how you look at it. The spec didn't explicitly include move construction in the cases where a hidden pointer to a temporary needed to be passed. wasn't conforming to the IA-64 C++ ABI. For example compiled the way clang used to do it was not compatible with gcc or newer versions of clang. The spec was eventually updated and the clang behavior changed in version 5.0.

Update: Marc Glisse mentions in the comments that there was initially confusion about the interaction of non-trivial move constructors and the C++ ABI, and clang changed their behavior at some point, which probably explains the switch:

The ABI specification for some argument passing cases involving move constructors were unclear, and when they were clarified, clang changed to follow the ABI. This is probably one of those cases.

Aerugo answered 24/10, 2017 at 3:41 Comment(10)
Why is std::pair<int,int> trivially copyable while std::tuple<int,int> isn't? I'd expect them both to be. Either way, doesn't this just need a trivial move constructor and destructor, and not a trivial copy constructor?Engrossment
@DanielH it isn't (in libstdc++) because the original implementation wasn't, and then not breaking the ABI made it impossible to change :-(Renascence
To be fair, @DanielH question is valid: on my implementation std::is_trivially_copyable returns false for both tuple and pair, so the C++ standard definition of that term should not be what's causing the difference. A a reading of the IA-64 C++ ABI indicates that to be passable in registers (which also means without the hidden pointer when used as a return type), a C++ class must not have either a non-trivial copy constructor or a non-trivial destructor, and A de/constructor is trivial if it is an implicitly-declared default. Neither tutpe nor pair satisfy that.Aerugo
@DanielH - I updated my answer: it is because std::tuple has a non-trivial move constructor, but std::pair does not. Yes, this is a violation of the standard which specifies that tuple should have a default move constructor.Aerugo
@BeeOnRope: Do you mean to say that the GNU standard library has a defective implementation of std::tuple? If so, won't it be basically impossible to change now? I'm also still a bit lost as to why Clang 3 and 4 did a better job optimizing std::tuple (I guess they're using a different standard library on Godbolt where I tested them, but if they can do it, why can't Clang 5?).Rectus
@JohnZwinck - correct, the implementation is not standards compliant in this area, if that's what you mean by defective. Whether the explicit move constructor helps or hurts performance overall, I'm not sure. Note that it has nothing to do with "optimization" per se, it's all about the ABI. Both clang and gcc are already optimizing the library code to maximum extent possible under the constraints of the ABI. If they generated code like f1() for standard tuple, it would fail at runtime because the caller wouldn't get the expected result.Aerugo
The ABI specification for some argument passing cases involving move constructors were unclear, and when they were clarified, clang changed to follow the ABI. This is probably one of those cases.Renascence
Do they really call octets "eightbytes"? Interesting.Bolding
@LightnessRacesinOrbit, No; the terminology in these ABIs is: "byte refers to an 8-bit object... eightbyte refers to a 64-bit object".Staffordshire
Is the non-standards-compliance of libstdc++ something, that they can change with the upcoming new C++20 standard? E.G., that the move constructor of std::tuple will be trivial, as specified in the standard, when compiling with -std=c++20, but will be as is currently with other compiler flags? Or is it required, that modules compiled with different standards can be linked together?Conclave

© 2022 - 2024 — McMap. All rights reserved.