Contiguous storage of polymorphic types
Asked Answered
O

5

16

I'm interested to know if there is any viable way to contiguously store an array of polymorphic objects, such that virtual methods on a common base can be legally called (and would dispatch to the correct overridden method in a subclass).

For example, considering the following classes:

struct B {
  int common;
  int getCommon() { return common; }
  virtual int getVirtual() const = 0;
}

struct D1 : B {
  virtual int getVirtual final const { return 5 };
}

struct D2 : B {
  int d2int;
  virtual int getVirtual final const { return d2int };
}

I would like to allocate a contiguous array of D1 and D2 objects, and treat them as B objects, including calling getVirtual() which will delegate to the appropriate method depending on the object type. Conceptually this seems possible: each object knows its type, typically via an embedded vtable pointer, so you could imagine, storing n objects in an array of n * max(sizeof(D1), sizeof(D2)) unsigned char, and using placement new and delete to initialize the objects, and casting the unsigned char pointer to B*. I'm pretty sure a cast is not legal, however.

One could also imagine creating a union like:

union Both {
  D1 d1;
  D2 d2;
}

and then creating an array of Both, and using placement new to create the objects of the appropriate type. This again doesn't seem to offer a way to actually call B::getVirtual() safely, however. You don't know the last stored type for the elements, so how are you going to get your B*? You need to use either &u.d1 or &u.d2 but you don't know which! There are actually special rules about "initial common subsequences" which let you do some things on unions where the elements share some common traits, but this only applies to standard layout types. Classes with virtual methods are not standard layout types.

Is there any way to proceed? Ideally, a solution would look something like a non-slicing std::vector<B> that can actually contain polymorphic subclasses of B. Yes, if required one might stipulate that all possible subclasses are known up front, but a better solution would only need to know the maximum likely size of any subclass (and fail at compile time if someone tries to add a "too big" object).

If it isn't possible to do with the built-in virtual mechanism, other alternatives that offer similar functionality would also be interesting.

Background

No doubt someone will ask "why", so here's a bit of motivation:

It seems generally well-known that using virtual functions to implement runtime polymorphism comes at a moderate overhead when actually calling virtual methods.

Not as often discussed, however, is the fact that using classes with virtual methods to implement polymorphism usually implies a totally different way of managing the memory for the underlying objects. You cannot just add objects of different types (but a common base) to a standard container: if you have subclasses D1 and D2, both derived from base B, a std::vector<B> would slice any D1 or D2 objects added. Similarly for arrays of such objects.

The usual solution is to instead use containers or arrays of pointers to the base class, like std::vector<B*> or perhaps std::vector<unique_ptr<B>> or std::vector<shared_ptr<B>>. At a minimum, this adds an extra indirection when accessing each element1, and in the case of the smart pointers, it breaks common container optimizations. If you are actually allocating each object via new and delete (including indirectly), then the time and memory cost of storing your objects just increased by a large amount.

Conceptually it seems like various subclasses of a common base can be stored consecutively (each object would consume the same amount of space: that of the largest supported object), and that a pointer to an object could be treated as a base-class pointer. In some cases, this could greatly simply and speed-up use of such polymorphic objects. Of course, in general, it's probably a terrible idea, but for the purposes of this question let's assume it has some niche application.


1 Among other things, this indirection pretty much prevents any vectorization of the same operation applied to all elements and harms locality of reference with implications both for caching and prefetching.

Outrank answered 9/10, 2017 at 4:48 Comment(14)
As you say, generally a terrible idea. For this question: Fastest implementation of simple, virtual, observer-sort of, pattern in c++?, the right answer was to have separate containers of B and C objects, so you can loop over all the Bs and then all the Cs, with the Update() method inlined and auto-vectorized thanks to static typing. (And smaller objects: no vtable pointer). See @Bee's answer there and discussion on mine for some of the inspiration for this question.Shipmate
Well the "right answer" depends on the context: of course if you didn't need virtual dispatch and can maintain N separate collections, then that's a fine approach! It's an answer to a question different than the OP asked, however. Really I just say "terrible idea" in a probably fruitless attempt to ward off all the "why"? did you measure? premature optimization!!! comments... I do actually think that ordered, contiguous storage of polymorphic objects might be quite useful in some cases, especially when there are both polymorphic and non-polymorphic uses.Outrank
The OP there did just want separate containers, but hadn't thought of it. I meant it was the solution to the XY problem the OP was having. This is a good question for the cases where you do need ordering. I haven't yet looked at the asm you get from std::visit over a container of std::variant<D1,D2>. Use that instead of a vtable, though, (with non-virtual overrides) because variant uses its own tag, not a vtable ptr.Shipmate
I suppose one approach is to use the big array of unsigned char and allocate each object with placement new, and then basically assert that casting the unsigned char pointer to B * yields the same value as casting the pointer of the actual most-derived type (e.g., D1) to B *. Most likely such a check simply generates no actual code, but then at least you can be sure that just casting the "cell" for the object to B * works (i.e., catches multiple inheritance cases where this breaks).Outrank
Here's the asm for the std::variant approach. It's not terrible. It cuts out at least one level of indirection versus the "vector of pointers" approach. It basically uses the variant index value to index into a jump table - so very much like a vtable except you don't actually need to look up the vtable to use. There is some extra code to check that the index isn't -1 indicating an "empty" variant, and the functions jumped to appear to do a check that the variant has the "expected" type which seems unnecessary (you just dispatched there...). @PeterCordesOutrank
Yeah, that's pretty gross for only 2 alternatives. Supporting the "uninitialized" state seems to hurt. I was hoping it would inline the two possible functions and select between them based on the type byte. You might be able to get that behaviour by writing D1s *tmp = std::get_if<D1s>(&b); and so on to explicitly try the two alternatives instead of using std::visit dispatch, with no throwing. godbolt.org/g/TjuSCn has an attempt that isn't working. (I think it always return nullptr). holds_alternative is another possibility, but then you still have to access the object.Shipmate
@PeterCordes - this is probably what you wanted: I had return 0 instead of return total in the original code which optimized everything away, oops. Not too bad, but still has an "extra" check for the type in the else case (I guess it needs it, to support the empty case). At least everything is inlined. Probably just building this yourself with a discriminated union would be better.Outrank
You could use this for a few important optimized loops, and still have all the functionality of std::variant for other use-cases. The version you linked checked for D1s in both branches, so it got optimized away because NULL-dereference is UB. So there was no loop, just 5 * size() or something. Fixed version: godbolt.org/g/GrYnsA. clang5.0 is very branchy; clang4.0 uses a cmov. gcc includes a UD2 for the null-pointer deref possibility. Interesting idea to just use the get_if<D2s> result unconditionally so there's UB if it isn't D1s or D2s, like __builtin_unreachable().Shipmate
Oops, I had the right code ultimately but linked that earlier wrong version. I tried also get<D2s> and the code was about the same, with an exception path instead of ud2. Too bad gcc doesn't use the UB to go back in time and just avoid the check entirely (i.e., it can determine that get_if must not return null, hence get_if must return exactly the pointer to the contained D2 object...Outrank
It's doing that on purpose. Use -fno-isolate-erroneous-paths-dereference (godbolt.org/g/KUsmLs) to treat a null-deref as __builtin_unreachable() instead. (documented under the positive form, enabled at -O2 and higherShipmate
@PeterCordes - cool, I learned something new, but it still doesn't work as I'd hope with that option. Look at what gcc is doing at label .L4: it's still checking the type of the object, and now it's just cmoving either 0 (rdi) or the object address prior to dereferencing it. So it has explicitly preserved the null access and has to do a lot of work to get there. All that code should be eliminated and the subsequent add should simply be add eax, DWORD PTR [rdx+4]. Doing it manually seems to work though, code is almost optimal.Outrank
Ah crap, yup I just saw the ud2 go away and hoped all those extra instructions were doing something good. :P So it doesn't assume that null-deref doesn't happen; i.e. it's not the same as __builtin_unreachable(). You have to use __builtin_unreachable() manually to tell gcc when something can't happen, and then you do get nice-ish branchy code with no extra checks: godbolt.org/g/JpHE4q. (But beware of if(x) __builtin_unreachable() with ICC: it will actually check and jcc to just after the ret in the current function, last I looked. So it slows you down and sucks.)Shipmate
@peter yup, see my "manually" link above. It's wired thought because certainly in other scenarios GCC uses "null deref is UB" to optimize things including all the various confusing "backwards in time" type optimizations that are so confusing.Outrank
Oh, I didn't look, I thought "manually" was going to be with your own tagged union, not using variant :PShipmate
S
7

You were almost there with your union. You can use either a tagged union (add an if to discriminate in your loop) or a std::variant (it introduces a kind of double dispatching through std::find to get the object out of it) to do that. In neither case you have allocations on the dynamic storage, so data locality is guaranteed.
Anyway, as you can see, in any case you can replace an extra level of indirection (the virtual call) with a plain direct call. You need to erase the type somehow (polymorphism is nothing more than a kind of type erasure, think of it) and you cannot get out directly from an erased object with a simple call. ifs or extra calls to fill the gap of the extra level of indirection are required.

Here is an example that uses std::variant and std::find:

#include<vector>
#include<variant>

struct B { virtual void f() = 0; };
struct D1: B { void f() override {} };
struct D2: B { void f() override {} };

void f(std::vector<std::variant<D1, D2>> &vec) {
    for(auto &&v: vec) {
        std::visit([](B &b) { b.f(); }, v);
    }
}

int main() {
    std::vector<std::variant<D1, D2>> vec;
    vec.push_back(D1{});
    vec.push_back(D2{});
    f(vec);
}

For it's really close, it doesn't worth it posting also an example that uses tagged unions.


Another way to do that is by means of separate vectors for the derived classes and a support vector to iterate them in the right order.
Here is a minimal example that shows it:

#include<vector>
#include<functional>

struct B { virtual void f() = 0; };
struct D1: B { void f() override {} };
struct D2: B { void f() override {} };

void f(std::vector<std::reference_wrapper<B>> &vec) {
    for(auto &w: vec) {
        w.get().f();
    }
}

int main() {
    std::vector<std::reference_wrapper<B>> vec;
    std::vector<D1> d1;
    std::vector<D2> d2;

    d1.push_back({});
    vec.push_back(d1.back());
    d2.push_back({});
    vec.push_back(d2.back());

    f(vec);
}
Ski answered 9/10, 2017 at 5:24 Comment(15)
The problem with a tagged union is that (a) it can take a lot of space for the tag (often more than you'd expect due to alignment), yet the objects already know their own type so it is kind of redundant. It also means removing the virtual nature of call (or leaving it and suffering double the cost), which might not be OK since the objects may be used in other more normal scenarios where virtual-based polymorphism is required.Outrank
@Outrank Actually no. For you know the exact type, compilers can devirtualize the call. You can leave there the virtual method. Also, std::variant is a kind of compile-time time, type-safe tagged union equivalent but it should occupy less. Have a try with it and std::visit when iterating.Ski
Well partly yes - if it is declared final. Otherwise the compiler can at least speculatively de-virtualize it (i.e., inline the known implementation guarded by a type-check). Probably just using final is OK... I will have to check out std::variant.Outrank
@Outrank Well, you want the cake and eat it. You must do something!! :-)Ski
Yes, I want both :). It's worth noting that current implementations actually "just work" if you just store the objects contiguously (in a field of a fixed size) and cast to B *. Of course it's UB, but it's certainly fundamentally possible to do this without needing an additional discriminant. So I can have cake and eat it with UB, or in assembly and my question is mostly whether we can get there in C++. I guess I'm not too worried about the cost of the virtual methods, but I want contiguous storage so that operations on say B::common can be efficient and perhaps even vectorized.Outrank
As far as I can tell std::variant doesn't offer any magic here: it also embeds a field which tracks which of the alternatives is present. So I don't see it saving space.Outrank
@Outrank Another solution would be to store A and B in different vectors, then get a vector of indices that tells you what's their order of iteration.Ski
@Outrank See std::visit for the magic of a std::variant. As I said, is a kind of safer tagged union at compile-time, how it stores and gets data is completely implementation defined, so do not make assumptions on that.Ski
I'm not seeing the magic there. No doubt std::visit simply consults the variant::index() value and calls the appropriate visitor. Yes, it's neat. Yes, it's safer. No, I don't see how any implementation could implement it in general without additional state for variants with 2 or more alternatives. Do you have any examples of such implementations? Indeed, it seems mathematically impossible to me: there are no "extra bits" in many cases, yet variants must know their type and also store the valueless_by_exception state (which probably means even variants with 1 alternative need extra space)Outrank
@Outrank You can find several pre-C++17 implementations of a variant online. I agree that it works probably as a tagged union, but I'd bet also that a compiler has more chances to perform optimizations on a tool from the standard library. That being said, all of these are speculative thoughts made at 8AM on the bus while going to work, so keep them for what they are... :-)Ski
@BeeOnRope: gcc and clang suck at auto-vectorizing over one member of an array of structs. e.g. instead of adding set(0,0,1,0) to load/store the vtable or unaffected struct member (2 or 4x 16-byte structs per clock with AVX2 or AVX512), they either go scalar or (with AVX512) shuffle a bunch to do one vpaddd and then un-shuffle or scatter (significantly less than 1 element per clock, worse than scalar). godbolt.org/g/2hLUfV, scroll-to-asm for the loop over C in separate_containers; it has a vtable pointer. But same thing for struct {int a,b,c,d;}; won't load/store others.Shipmate
@PeterCordes - apart from the fact that the optimizer may not be able to handle it, it would often violate the memory model to overwrite unaffected members even with the "same value". Separate structure members are separate locations for the purposes of the model and can't be seen to change asynchronously by another thread, so many such transformations that involve writes would be illegal unless the compiler can provide the value is thread-private. This probably doesn't apply to vtables which are constant. Read-only vectorization with ignored values is fine though.Outrank
@BeeOnRope: Oh, I was thinking that separate threads weren't allowed to access separate members of the same object, and that it would be safe for any object without atomic or volatile members. But you're right, that wouldn't cause a data race (which is the language the standard uses to actually set the rules), so I guess compilers do have to strictly limit themselves even within a single object, except for actual padding (or the vtable pointer) :/Shipmate
Yes, it raises the issue that it might be useful to have a keyword like thread_restrict or something which tells the compiler the object won't be used concurrently on any other thread, much like how restrict can be used to communicate non-aliasing. Apparently even vtable pointers can mutate, via placement new overwriting the object, and there are gcc bugs about it. It might not matter for this case though since it implies overwriting the whole object (so there is a race if this is done concurrently).Outrank
@Ski - yes, it works and I am using this solution now in various places. It has some downsides, such as needing to know the exact types of all classes beforehand (this downside is shared with many other solutions), having an extra layer of indirection while iterating, having worse locality of reference, having the overhead of the pointers/refs in the ordered array and making manipulation of the "collection" complex (e.g., if you delete an element, you need to fix up the storage arrays which may be very slow without "back pointers"). It's a contender though...Outrank
P
4

I try to implement what you want without memory overhead:

template <typename Base, std::size_t MaxSize, std::size_t MaxAlignment>
struct PolymorphicStorage
{
public:
    template <typename D, typename ...Ts>
    D* emplace(Ts&&... args)
    {
        static_assert(std::is_base_of<Base, D>::value, "Type should inherit from Base");
        auto* d = new (&buffer) D(std::forward<Ts>(args)...);
        assert(&buffer == reinterpret_cast<void*>(static_cast<Base*>(d)));
        return d;
    }

    void destroy() { get().~Base(); }

    const Base& get() const { return *reinterpret_cast<const Base*>(&buffer); }
    Base& get() { return *reinterpret_cast<Base*>(&buffer); }

private:
    std::aligned_storage_t<MaxSize, MaxAlignment> buffer;
};

Demo

But problems are that copy/move constructors (and assignment) are incorrect, but I don't see correct way to implement it without memory overhead (or additional restriction to the class).

I cannot =delete them, else you cannot use them in std::vector.

With memory overhead, variant seems then simpler.

Papistry answered 9/10, 2017 at 16:1 Comment(4)
Hmm, that's kind of along the lines I was thinking (except that probably I'd want a container that can store N objects, not just 1). Good point about copy/move constructors. To do that properly you need to recover the original type in a static way (unless of course you require each object to declare it's own virtual copy-constructor-like methods to do this). The idea of the assert is to check that the representation of a D * as a B * is bit-wise identical or something like that?Outrank
First, I was not sure about return-value-of-placement-new, but that seems not problematic. Secondly, the part which is really problematic is cases like struct Derived : Base1, Base {}; where we can have offset between Base and Derived.Papistry
"Secondly, the part which is really problematic" this can be easily solved by storing a Base* ( downcasted inside emplace() from a D* ) and using it in place or reinterpret_cast() in get()Abutilon
@MassimilianoJanes: "but I don't see correct way to implement it without memory overhead": Adding overhead of pointer on Base would indeed solve that issue, but that introduce memory overhead. With memory overhead, variant allows copy.Papistry
E
4

So, this is really ugly, but if you're not using multiple inheritance or virtual inheritance, a Derived * in most implementations is going to have the same bit-level value as a Base *.

You can test this with a static_assert so things fail to compile if that's not the case on a particular platform, and use your union idea.

#include <cstdint>

class Base {
 public:
   virtual bool my_virtual_func() {
      return true;
   }
};

class DerivedA : public Base {
};

class DerivedB : public Base {
};

namespace { // Anonymous namespace to hide all these pointless names.
constexpr DerivedA a;
constexpr const Base *bpa = &a;
constexpr DerivedB b;
constexpr const Base *bpb = &b;

constexpr bool test_my_hack()
{
   using ::std::uintptr_t;

   {
      const uintptr_t dpi = reinterpret_cast<uintptr_t>(&a);
      const uintptr_t bpi = reinterpret_cast<uintptr_t>(bpa);
      static_assert(dpi == bpi, "Base * and Derived * !=");
   }
   {
      const uintptr_t dpi = reinterpret_cast<uintptr_t>(&b);
      const uintptr_t bpi = reinterpret_cast<uintptr_t>(bpb);
      static_assert(dpi == bpi, "Base * and Derived * !=");
   }
   // etc...
   return true;
}
}

const bool will_the_hack_work = test_my_hack();

The only problem is that constexpr rules will forbid your objects from having virtual destructors because those will be considered 'non-trivial'. You'll have to destroy them by calling a virtual function that must be defined in every derived class that then calls the destructor directly.

But, if this bit of code succeeds in compiling, then it doesn't matter if you get a Base * from the DerivedA or DerivedB member of your union. They're going to be the same anyway.

Another option is to embed a pointer to a struct full of member function pointers at the beginning of a struct that contains that pointer and a union with your derived classes in it and initialize it yourself. Basically, implement your own vtable.

Estivation answered 17/10, 2017 at 14:32 Comment(6)
Thanks. Even with the check, this is UB-but-likely-to-work-in-practice, right?Outrank
@Outrank - yes. UB, but likely to work in practice. The checks are to test for the most likely reasons it wouldn't work. I find it hard to envision an implementation that passed them that didn't work. But my imagination might well be too limited. I presume my answer is clear enough? You seemed knowledgeable enough to fill in the gaps.Estivation
@Outrank - my 'create your own vtable' idea is not UB, but it's also just an aside and not really what this answer is about.Estivation
Yes, it's clear! The POD-with-function pointers (or pointer to a table of function pointers) is an idea I seem to have been triangulating towards as well. It removes the various reasons for UB but keeps much of the benefits of polymorphsim (but not all).Outrank
@Outrank - One of the reasons a pointer to a table is a good idea is that the table can end up in cache which will make the vcalls faster (which is also true of actual polymorphism of course). Also, you can defined a bunch of inline member functions for the struct that pull things out of the your ersatz vtable and call them with the appropriate pointer fixup and that would make the solution nearly painless to use (though still a bit of a pain to set up).Estivation
right, it depends on how many "virtual" functions you have. If there's just one, the vtable approach is strictly worse, performance wise-since you suffer the additional indirection, and the objects are the same size (they either have a vtable pointer or a single function pointer). At two functions it's a space vs speed trade-off - directly embedded function pointers are still probably faster, but the objects are larger. Eventually the vtable approach will probably win even in speed when the bad locality and cache pressure of the direct method becomes too much.Outrank
D
3

There was a talk at CppCon 2017, "Runtime Polymorphism - Back to the Basics", that discussed doing something like what you are asking for. The slides are on github and a video of the talk is available on youtube.

The speaker's experimental library for achieving this, "dyno", is also on github.

Demoss answered 18/10, 2017 at 19:1 Comment(6)
Thanks, I sent the speaker an email asking if his video is available.Outrank
This isn't properly an answer. You should at least put efforts to sum up the content of the talk. Link only answers are usually flagged on SO.Ski
@Ski you're right. I was aware of that when I make the answer. I was initially was only going to post this as a comment, but it didn't think if fit in the role of a comment either. I did not try to sum up the content of the talk partly due to laziness and partly because a lot of the content of the talk was over my head.Demoss
To be fair, it's not just a single link. You can interpret this as "You can use the dynn library to do this (which is a valid answer)" and here's a link ot the library and a talk about the technique. Of course, it would be good to briefly sum up how dyno works but IMO it meets the bar for an answer.Outrank
I will try to edit my answer when I get some time in the future to give a brief summary.Demoss
@ColinAndrews - the video wasn't in fact available, but it has since been uploaded. I updated your question to include a link.Outrank
M
1

It seems to me that you're looking for a variant, which is a tagged union with safe access.

c++17 has std::variant. For prior versions, boost offers a version - boost::variant

Note that the polymorphism is no longer necessary. In this case I have used signature-compatible methods to provide the polymorphism, but you can also provide it through signature-compatible free functions and ADL.

#include <variant>   // use boost::variant if you don't have c++17
#include <vector>
#include <algorithm>

struct B {
  int common;
  int getCommon() const { return common; }
};

struct D1 : B {
  int getVirtual() const { return 5; }
};

struct D2 : B {
  int d2int;
  int getVirtual() const { return d2int; }
};

struct d_like
{
    using storage_type = std::variant<D1, D2>;

    int get() const {
        return std::visit([](auto&& b)
        {
            return b.getVirtual();
        }, store_);
    }

    int common() const { 
        return std::visit([](auto&& b)
        {
            return b.getCommon();
        }, store_);
    };

    storage_type store_;
};

bool operator <(const d_like& l, const d_like& r)
{
    return l.get() < r.get();
}

struct by_common
{
    bool operator ()(const d_like& l, const d_like& r) const
    {
        return l.common() < r.common();
    }
};

int main()
{
    std::vector<d_like> vec;
    std::sort(begin(vec), end(vec));
    std::sort(begin(vec), end(vec), by_common());
}
Mis answered 9/10, 2017 at 6:33 Comment(2)
Thanks Richard. As discussed in the comments to skypjack's answer, the primary issue with std::variant is that it (almost certainly) has a hidden "discriminant" field which bloats the size of the object (in a similar way to a vtable pointer). Sometimes it isn't possible to remove the virtual-based polymorphism from an object (e.g,. because it is used in other contexts where that type of polymorphism is needed), so you pay the size costs of both approaches. When virtual can be removed, this seems like a reasonable approach.Outrank
@Outrank Understood. Although using a variant removes the need for any polymorphism at all, anywhere in the program, if the number of types is known at at compile time.Mis

© 2022 - 2024 — McMap. All rights reserved.