Say I'm working on a library called libModern. This library uses a legacy C library, called libLegacy, as an implementation strategy. libLegacy's interface looks like this:
typedef uint32_t LegacyFlags;
struct LegacyFoo {
uint32_t x;
uint32_t y;
LegacyFlags flags;
// more data
};
struct LegacyBar {
LegacyFoo foo;
float a;
// more data
};
void legacy_input(LegacyBar const* s); // Does something with s
void legacy_output(LegacyBar* s); // Stores data in s
libModern shouldn't expose libLegacy's types to its users for various reasons, among them:
- libLegacy is an implementation detail that shouldn't be leaked. Future versions of libModern might chose to use another library instead of libLegacy.
- libLegacy uses hard-to-use, easy-to-misuse types that shouldn't be part of any user-facing API.
The textbook way to deal with this situation is the pimpl idiom: libModern would provide a wrapper type that internally has a pointer to the legacy data. However, this is not possible here, since libModern cannot allocate dynamic memory. Generally, its goal is not to add a lot of overhead.
Therefore, libModern defines its own types that are layout-compatible with the legacy types, yet have a better interface. In this example it is using a strong enum
instead of a plain uint32_t
for flags:
enum class ModernFlags : std::uint32_t
{
first_flag = 0,
second_flag = 1,
};
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
};
struct ModernBar {
ModernFoo foo;
float a;
// more data
};
Now the question is: How can libModern convert between the legacy and the modern types without much overhead? I know of 3 options:
reinterpret_cast
. This is undefined behavior, but in practice produces perfect assembly. I want to avoid this, since I cannot rely on this still working tomorrow or on another compiler.std::memcpy
. In simple cases this generates the same optimal assembly, but in any non-trivial case this adds significant overhead.- C++20's
std::bit_cast
. In my tests, at best it produces exactly the same code asmemcpy
. In some cases it's worse.
This is a comparison of the 3 ways to interface with libLegacy:
- Interfacing with
legacy_input()
- Using
reinterpret_cast
:
Assembly:void input_ub(ModernBar const& s) noexcept { legacy_input(reinterpret_cast<LegacyBar const*>(&s)); }
This is perfect codegen, but it invokes UB.input_ub(ModernBar const&): jmp legacy_input
- Using
memcpy
:
Assembly:void input_memcpy(ModernBar const& s) noexcept { LegacyBar ls; std::memcpy(&ls, &s, sizeof(ls)); legacy_input(&ls); }
Significantly worse.input_memcpy(ModernBar const&): sub rsp, 24 movdqu xmm0, XMMWORD PTR [rdi] mov rdi, rsp movaps XMMWORD PTR [rsp], xmm0 call legacy_input add rsp, 24 ret
- Using
bit_cast
:
Assembly:void input_bit_cast(ModernBar const& s) noexcept { LegacyBar ls = std::bit_cast<LegacyBar>(s); legacy_input(&ls); }
And I have no idea what's going on here.input_bit_cast(ModernBar const&): sub rsp, 40 movdqu xmm0, XMMWORD PTR [rdi] mov rdi, rsp movaps XMMWORD PTR [rsp+16], xmm0 mov rax, QWORD PTR [rsp+16] mov QWORD PTR [rsp], rax mov rax, QWORD PTR [rsp+24] mov QWORD PTR [rsp+8], rax call legacy_input add rsp, 40 ret
- Using
- Interfacing with legacy_output()
- Using
reinterpret_cast
:
Assembly:auto output_ub() noexcept -> ModernBar { ModernBar s; legacy_output(reinterpret_cast<LegacyBar*>(&s)); return s; }
output_ub(): sub rsp, 56 lea rdi, [rsp+16] call legacy_output mov rax, QWORD PTR [rsp+16] mov rdx, QWORD PTR [rsp+24] add rsp, 56 ret
- Using
memcpy
:
Assembly:auto output_memcpy() noexcept -> ModernBar { LegacyBar ls; legacy_output(&ls); ModernBar s; std::memcpy(&s, &ls, sizeof(ls)); return s; }
output_memcpy(): sub rsp, 56 lea rdi, [rsp+16] call legacy_output mov rax, QWORD PTR [rsp+16] mov rdx, QWORD PTR [rsp+24] add rsp, 56 ret
- Using
bit_cast
:
Assembly:auto output_bit_cast() noexcept -> ModernBar { LegacyBar ls; legacy_output(&ls); return std::bit_cast<ModernBar>(ls); }
output_bit_cast(): sub rsp, 72 lea rdi, [rsp+16] call legacy_output movdqa xmm0, XMMWORD PTR [rsp+16] movaps XMMWORD PTR [rsp+48], xmm0 mov rax, QWORD PTR [rsp+48] mov QWORD PTR [rsp+32], rax mov rax, QWORD PTR [rsp+56] mov QWORD PTR [rsp+40], rax mov rax, QWORD PTR [rsp+32] mov rdx, QWORD PTR [rsp+40] add rsp, 72 ret
- Using
Here you can find the entire example on Compiler Explorer.
I also noted that the codegen varies significantly depending on the exact definition of the structs (i.e. order, amount & type of members). But the UB version of the code is consistently better or at least as good as the other two versions.
Now my questions are:
- How come the codegen varies so dramatically? It makes me wonder if I'm missing something important.
- Is there something I can do to guide the compiler to generate better code without invoking UB?
- Are there other standard-conformant ways that generate better code?
reinterpret_cast. This is undefined behavior
Why? They come from C. Are classes not standard-layout types?this is not possible here, since libModern cannot allocate dynamic memory
Allocate on stack then. Why can't you justlegacy_input(&s.foo);
? – Templinoutput_ub()
function into this: auto output_ub() noexcept -> ModernBar { static ModernBar s; legacy_output(reinterpret_cast<LegacyBar*>(&s)); return ModernBar{}; } Because it knows that legacy output cannot dereference the pointer I'm giving it, and therefore s cannot be modified, and hence I'm returning an uninitializedModernBar
. – Moralize-fno-strict-aliasing
? – Sapreinterpret_cast
in C and in C++ using it for type punning invokes UB. Is reinterpret_cast type punning actually undefined behavior?, Why is this type punning not undefined behavior?, Isreinterpret_cast
actually good for anything?, Is this type punning well-defined?, news.ycombinator.com/item?id=17242553. If it's ok then there's no need to introducestd::bit_cast
– Drool