Efficient type punning without undefined behavior [duplicate]
Asked Answered
M

4

8

Say I'm working on a library called libModern. This library uses a legacy C library, called libLegacy, as an implementation strategy. libLegacy's interface looks like this:

typedef uint32_t LegacyFlags;

struct LegacyFoo {
    uint32_t x;
    uint32_t y;
    LegacyFlags flags;
    // more data
};

struct LegacyBar {
    LegacyFoo foo;
    float a;
    // more data
};

void legacy_input(LegacyBar const* s); // Does something with s
void legacy_output(LegacyBar* s); // Stores data in s

libModern shouldn't expose libLegacy's types to its users for various reasons, among them:

  • libLegacy is an implementation detail that shouldn't be leaked. Future versions of libModern might chose to use another library instead of libLegacy.
  • libLegacy uses hard-to-use, easy-to-misuse types that shouldn't be part of any user-facing API.

The textbook way to deal with this situation is the pimpl idiom: libModern would provide a wrapper type that internally has a pointer to the legacy data. However, this is not possible here, since libModern cannot allocate dynamic memory. Generally, its goal is not to add a lot of overhead.

Therefore, libModern defines its own types that are layout-compatible with the legacy types, yet have a better interface. In this example it is using a strong enum instead of a plain uint32_t for flags:

enum class ModernFlags : std::uint32_t
{
    first_flag = 0,
    second_flag = 1,
};

struct ModernFoo {
    std::uint32_t x;
    std::uint32_t y;
    ModernFlags flags;
    // More data
};

struct ModernBar {
    ModernFoo foo;
    float a;
    // more data
};

Now the question is: How can libModern convert between the legacy and the modern types without much overhead? I know of 3 options:

  1. reinterpret_cast. This is undefined behavior, but in practice produces perfect assembly. I want to avoid this, since I cannot rely on this still working tomorrow or on another compiler.
  2. std::memcpy. In simple cases this generates the same optimal assembly, but in any non-trivial case this adds significant overhead.
  3. C++20's std::bit_cast. In my tests, at best it produces exactly the same code as memcpy. In some cases it's worse.

This is a comparison of the 3 ways to interface with libLegacy:

  1. Interfacing with legacy_input()
    1. Using reinterpret_cast:
      void input_ub(ModernBar const& s) noexcept {
          legacy_input(reinterpret_cast<LegacyBar const*>(&s));
      }
      
      Assembly:
      input_ub(ModernBar const&):
              jmp     legacy_input
      
      This is perfect codegen, but it invokes UB.
    2. Using memcpy:
      void input_memcpy(ModernBar const& s) noexcept {
          LegacyBar ls;
          std::memcpy(&ls, &s, sizeof(ls));
          legacy_input(&ls);
      }
      
      Assembly:
      input_memcpy(ModernBar const&):
              sub     rsp, 24
              movdqu  xmm0, XMMWORD PTR [rdi]
              mov     rdi, rsp
              movaps  XMMWORD PTR [rsp], xmm0
              call    legacy_input
              add     rsp, 24
              ret
      
      Significantly worse.
    3. Using bit_cast:
      void input_bit_cast(ModernBar const& s) noexcept {
          LegacyBar ls = std::bit_cast<LegacyBar>(s);
          legacy_input(&ls);
      }
      
      Assembly:
      input_bit_cast(ModernBar const&):
              sub     rsp, 40
              movdqu  xmm0, XMMWORD PTR [rdi]
              mov     rdi, rsp
              movaps  XMMWORD PTR [rsp+16], xmm0
              mov     rax, QWORD PTR [rsp+16]
              mov     QWORD PTR [rsp], rax
              mov     rax, QWORD PTR [rsp+24]
              mov     QWORD PTR [rsp+8], rax
              call    legacy_input
              add     rsp, 40
              ret
      
      And I have no idea what's going on here.
  2. Interfacing with legacy_output()
    1. Using reinterpret_cast:
      auto output_ub() noexcept -> ModernBar {
          ModernBar s;
          legacy_output(reinterpret_cast<LegacyBar*>(&s));
          return s;
      }
      
      Assembly:
      output_ub():
              sub     rsp, 56
              lea     rdi, [rsp+16]
              call    legacy_output
              mov     rax, QWORD PTR [rsp+16]
              mov     rdx, QWORD PTR [rsp+24]
              add     rsp, 56
              ret
      
    2. Using memcpy:
      auto output_memcpy() noexcept -> ModernBar {
          LegacyBar ls;
          legacy_output(&ls);
          ModernBar s;
          std::memcpy(&s, &ls, sizeof(ls));
          return s;
      }
      
      Assembly:
      output_memcpy():
              sub     rsp, 56
              lea     rdi, [rsp+16]
              call    legacy_output
              mov     rax, QWORD PTR [rsp+16]
              mov     rdx, QWORD PTR [rsp+24]
              add     rsp, 56
              ret
      
    3. Using bit_cast:
      auto output_bit_cast() noexcept -> ModernBar {
          LegacyBar ls;
          legacy_output(&ls);
          return std::bit_cast<ModernBar>(ls);
      }
      
      Assembly:
      output_bit_cast():
              sub     rsp, 72
              lea     rdi, [rsp+16]
              call    legacy_output
              movdqa  xmm0, XMMWORD PTR [rsp+16]
              movaps  XMMWORD PTR [rsp+48], xmm0
              mov     rax, QWORD PTR [rsp+48]
              mov     QWORD PTR [rsp+32], rax
              mov     rax, QWORD PTR [rsp+56]
              mov     QWORD PTR [rsp+40], rax
              mov     rax, QWORD PTR [rsp+32]
              mov     rdx, QWORD PTR [rsp+40]
              add     rsp, 72
              ret
      

Here you can find the entire example on Compiler Explorer.

I also noted that the codegen varies significantly depending on the exact definition of the structs (i.e. order, amount & type of members). But the UB version of the code is consistently better or at least as good as the other two versions.

Now my questions are:

  1. How come the codegen varies so dramatically? It makes me wonder if I'm missing something important.
  2. Is there something I can do to guide the compiler to generate better code without invoking UB?
  3. Are there other standard-conformant ways that generate better code?
Moralize answered 13/1, 2022 at 22:55 Comment(14)
reinterpret_cast. This is undefined behavior Why? They come from C. Are classes not standard-layout types? this is not possible here, since libModern cannot allocate dynamic memory Allocate on stack then. Why can't you just legacy_input(&s.foo);?Templin
I don't know the C standard as well, but in C++ this definitely violates the aliasing rules. In theory, the compiler could "optimize" the output_ub() function into this: auto output_ub() noexcept -> ModernBar { static ModernBar s; legacy_output(reinterpret_cast<LegacyBar*>(&s)); return ModernBar{}; } Because it knows that legacy output cannot dereference the pointer I'm giving it, and therefore s cannot be modified, and hence I'm returning an uninitialized ModernBar.Moralize
What about #30618019 ? Also for "allocate on stack" - #4922432 .Templin
Have you considered compling with -fno-strict-aliasing?Sap
Yes. But that would mean I'm penalizing the optimization of the entire libModern because I couldn't find a way that is standard-compliant.Moralize
If you want to know whether the reinterpret_cast is accepted by all common compiler, the answer is yes, even if some will require specific option in order not to croak. And I would bet a coin that they will continue to support it for a good while because it used to be a common idiom (even it has always violated the ISO standards and even C ones) and rejecting it will break a lot of legacy but still in use code. If you want to know whether it can be made in a standard conformant way, you could probably add the language-lawyer tag.Adama
BTW, -fno-strict-aliasing is only relevant for the compilation phase, and you can link together modules compiled with or without it. So you can safely keep strict aliasing for all the other modules from your library.Adama
@SergeBallesta: I generally try to avoid non-compliant code. It feels wrong, in particular since the compilers supporting this kind of code seems to prevent legitimate optimization strategies that might benefit everyone. I have added a 3rd question about other standard-compliant ways to do this and the tag you mentioned. Maybe someone has another idea.Moralize
Ad -fno-strict-aliasing: I wasn't aware that this can safely be "contained" within a compilation module. I'll consider it. Thanks!Moralize
@Templin there's no reinterpret_cast in C and in C++ using it for type punning invokes UB. Is reinterpret_cast type punning actually undefined behavior?, Why is this type punning not undefined behavior?, Is reinterpret_cast actually good for anything?, Is this type punning well-defined?, news.ycombinator.com/item?id=17242553. If it's ok then there's no need to introduce std::bit_castDrool
How have you established that the overhead of converting between legacy and modern types is significant - and needs to be avoided? As it is at present, you've defaulted to using low level approaches that all have potential of undefined behaviour, but not actually demonstrated that the interface between libModern and libLegacy (or an alternative libLegacy2 that may be dropped in to replace libLegacy) gives a measurable performance concern that justifies using those low level approaches. Examining machine code doesn't provide that evidence - performance testing or profiling does.Oas
"the pimpl idiom [..] this is not possible here, since libModern cannot allocate dynamic memory". Might be seen as ugly, but you can do it without allocation. Your libModern classes should have buffer large enough and you do placement new... Interface should change as you no longer have direct member to use, but getters/setters are possible.Muddler
@Drag-On: I am unaware of any compiler that can't be configured to extend the semantics of the language by supporting type-punning constructs beyond those mandated by the Standard. The reason the Standard doesn't explicitly recognize such a dialect is that it is obvious how it should behave, and there was no need to expend ink telling compiler writers to do something they would do with or without such specificiation.Pretonic
@Muddler I thought about that as well, but then dismissed it as too complicated. The modern type would have to know the size and alignment of the legacy type without actually naming the legacy type (since that would leak the #include to users of the library). This is probably only feasible using code generation.Moralize
D
7

In your compiler explorer link, Clang produces the same code for all output cases. I don't know what problem GCC has with std::bit_cast in that situation.

For the input case, the three functions cannot produce the same code, because they have different semantics.

With input_ub, the call to legacy_input may be modifying the caller's object. This cannot be the case in the other two versions. Therefore the compiler cannot optimize away the copies, not knowing how legacy_input behaves.

If you pass by-value to the input functions, then all three versions produce the same code at least with Clang in your compiler explorer link.

To reproduce the code generated by the original input_ub you need to keep passing the address of the caller's object to legacy_input.

If legacy_input is an extern C function, then I don't think the standards specify how the object models of the two languages are supposed to interact in this call. So, for the purpose of the language-lawyer tag, I will assume that legacy_input is an ordinary C++ function.

The problem in passing the address of &s directly is that there is generally no LegacyBar object at the same address that is pointer-interconvertible with the ModernBar object. So if legacy_input tries to access LegacyBar members through the pointer, that would be UB.

Theoretically you could create a LegacyBar object at the required address, reusing the object representation of the ModernBar object. However, since the caller presumably will expect there to still be a ModernBar object after the call, you then need to recreate a ModernBar object in the storage by the same procedure.

Unfortunately though, you are not always allowed to reuse storage in this way. For example if the passed reference refers to a const complete object, that would be UB, and there are other requirements. The problem is also whether the caller's references to the old object will refer to the new object, meaning whether the two ModernBar objects are transparently replaceable. This would also not always be the case.

So in general I don't think you can achieve the same code generation without undefined behavior if you don't put additional constraints on the references passed to the function.

Duckworth answered 14/1, 2022 at 2:28 Comment(6)
Good answer. For input, couldn't you achieve the same semantics by memcpy'ing the effect of legacy_input back to a const_cast'ed pointer to the input, if one assumes that the input object is not part of a const object?Minard
@JeffGarrett That is one way that I think could implement the scheme I described. But you still have the problem that the function replaces the old ModernBar object with a new one. The pointer/glvalue which the caller held to the old object might not automatically refer to the new one. For example that is not the case if the ModernBar object is a base class subobject, see eel.is/c++draft/basic#life-8. If my understanding is correct, the function should send the pointer through std::launder and return that. The caller should then only use the returned pointer.Duckworth
@JeffGarrett But even then, for example if the ModernBar object is a subobject of another object, creating a LegacyBar object in it will cause the containing object's lifetime to end, causing even more problems: eel.is/c++draft/basic#life-1.5 (And there may be more problematic cases that I didn't consider. Of course all of this is strict reading of the standard, not how implementations actually work.)Duckworth
@Duckworth My first thought was libLegacy couldn't possibly legally modify the pointed-to memory due to the pointer being const. But of course that isn't true since technically it could const_cast away the constness and that would be legal if the original object wasn't declared const. In reality, there is no way it could know that, but I see why the semantics are different from the compiler's view.Moralize
@Duckworth Could you elaborate a little what you mean with "const-complete object"? I am not aware of this term.Moralize
@Moralize A complete object is an object that is neither a member, base class or array element of another object. If it has a const type it is a const complete object (don't know why I put the hyphen there). Creating new objects in storage previously occupied by such an object is not allowed, except if it was dynamically allocated. (eel.is/c++draft/basic#life-10).Duckworth
D
3

Most non-MSVC compilers support an attribute called __may_alias__ that you can use

struct ModernFoo {
    std::uint32_t x;
    std::uint32_t y;
    ModernFlags flags;
    // More data
} __attribute__((__may_alias__));

struct ModernBar {
    ModernFoo foo;
    float a;
    // more data
} __attribute__((__may_alias__));

Of course some optimizations can't be done when aliasing is allowed, so use it only if performance is acceptable

Godbolt link

Drool answered 27/3, 2022 at 15:51 Comment(1)
That's pretty cool, I wasn't aware of this attribute! This might allow for a finer granularity than just compiling with -fno-strict-aliasing.Moralize
P
2

Programs which would ever have any reason to access storage as multiple types should be processed using -fno-strict-aliasing or equivalent on any compiler that doesn't limit type-based aliasing assumptions around places where a pointer or lvalue of one type is converted to another, even if the program uses only corner-case behaviors mandated by the Standard. Using such a compiler flag will guarantee that one won't have type-based-aliasing problems, while jumping through hoops to use only standard-mandated corner cases won't. Both clang and gcc are sometimes prone to both:

  1. have one phase of optimization change code whose behavior would be mandated by the Standard into code whose behavior isn't mandated by the Standard would be equivalent in the absence of further optimization, but then

  2. have a later phase of optimization further transform the code in a manner that would have been allowable for the version of the code produced by #1 but not for the code as it was originally written.

If using -fno-strict-aliasing on straightforwardly-written source code yields machine code whose performance is acceptable, that's a safer approach than trying to jump through hoops to satisfy constraints that the Standard allows compilers to impose in cases where doing so would allow them to be more useful [or--for poor quality compilers--in cases where doing so would make them less useful].

Pretonic answered 14/1, 2022 at 19:24 Comment(0)
V
0

You could create a union with a private member to restrict access to the legacy representation:

union UnionBar {
    struct {
        ModernFoo foo;
        float a;
    };
private:
    LegacyBar legacy;

    friend LegacyBar const* to_legacy_const(UnionBar const& s) noexcept;
    friend LegacyBar* to_legacy(UnionBar& s) noexcept;
};


LegacyBar const* to_legacy_const(UnionBar const& s) noexcept {
    return &s.legacy;
}

LegacyBar* to_legacy(UnionBar& s) noexcept {
    return &s.legacy;
}


void input_union(UnionBar const& s) noexcept {
    legacy_input(to_legacy_const(s));
}

auto output_union() noexcept -> UnionBar {
    UnionBar s;
    legacy_output(to_legacy(s));
    return s;
}

The input/output functions are compiled to the same code as the reinterpret_cast-versions (using gcc/clang):

input_union(UnionBar const&):
        jmp     legacy_input

and

output_union():
        sub     rsp, 56
        lea     rdi, [rsp+16]
        call    legacy_output
        mov     rax, QWORD PTR [rsp+16]
        mov     rdx, QWORD PTR [rsp+24]
        add     rsp, 56
        ret

Note that this uses anonymous structs and requires you to include the legacy implementation, which you mentioned you do not want. Also, I'm missing the experience to be fully confident that there's no hidden UB, so it would be great if someone else would comment on that :)

Veda answered 17/1, 2022 at 17:18 Comment(1)
Type punning via unions is always UB in C++ (but it is different in C). Of course this doesn't mean that a compiler won't make extra guarantees that it will work.Duckworth

© 2022 - 2024 — McMap. All rights reserved.