Object orientation, data orientation, cache pollution and cache obviousness

Asked 11/9, 2012 at 8:20 Answered 23/12, 2017 at 10:16

performance oop caching memory-efficient data-oriented-design

In a regular object oriented practice it is not that rare objects have multiple unrelated member properties. And when objects are being processed, it is not rare that it is done in different passes, which aim for different parts of their properties.

In this regard, the typical approach to create collections of objects doesn't seem to be a very efficient one. Considering the ways computers access memory and the average size of cache lines there is a very good chance cache memory is getting filled with that that is not needed, but simply happened to be adjacent, so it ends up wasting capacity of the cache and adding stalling and latency to execution.

Even worse is the practice of using polymorphism and allocating objects dynamically, without memory pools and custom allocators. In this case, not only is cache getting filled with unneeded data, but due to the arbitrary addresses, used by dynamic memory allocation, the prefetchers fail to work adequately as well.

The salvation is to go back to a pre-OOP times and opt for data orientation, which seems to be the choice of preference for the development of performance critical applications, operating systems and such. But why not use a hybrid version of the two? Sort of Data Oriented Object programming?

After that long overture, let's get to the question at hand. I don't have a massive enough project to test the efficiency of this concept, so the theoretical expertise of the community is very welcome.

What about instead of objects storing their own data members, they only store a reference to the collections, where their data members are stored sequentially in their own containers, and their member methods return data from those containers, this way the odds of unneeded data ending up on its way to the CPU should be reduced and the odds of data, needed in the near "future" is increased. The logical assumption is that this approach will improve prefetcher efficiency, cache hits and usage efficiency and will also reduce latencies, involved in automatic and manual parallelization.

What do you think?

A late edit: Applying a "data orientation pattern" can be even more beneficial if we take struct and class padding into consideration, if a "model" has a char and an int data members, in OOP fashion it will be padded, which will only pollute cache further, but a data oriented storage mode can store all the chars and all the ints sequentially, with no space and cache wasted at all.

Roue answered 11/9, 2012 at 8:20 Comment(5)

Here is an interesting and related PDF that I just came across. – Vagarious 11/9, 2012 at 8:51

@Vagarious - thanks for the link, it seems performance improvement is even more significant than I anticipated. Amazingly, I have never heard anyone talking in the subject in the several programming materials I've been through... – Roue 11/9, 2012 at 10:35

I know this as the "array of structs" vs "struct of arrays" distinction. If you google for both terms, you'll get quite some material. – Commingle 11/9, 2012 at 16:1

Excellent question, and I too have noticed the lack of material on this subject. – Hiroko 6/11, 2013 at 4:1

In fact the more I think about it, the more I really like this concept. I am just about to start refactoring some underforming OOP based C++ code in a high performance project, and I think I'll experiment with this idea. Hopefully I can post some results in a few months. – Hiroko 6/11, 2013 at 4:20

First of all, nice slide presentation. Well, as I could understand, your question is quite different from the presentation aproach. Variables are stored randomly in main memory, even objects attributes. If you try to allocate memory to a continguous data struct, your data struct's size going to be limited by the largest "air bubble" in your main memory, else it won't be purely continguous. Maybe you thought something like this:

class MyClass
{
public:
    MyClass()
    {
        m_dataMembers = new GenericObject[DATA_MEMBERS_AMOUNT];

        //initialize array values...
    }

    int getMyVar()
    {
        return (int)m_dataMembers[MY_VAR_INDEX];
    }

    //others functions...

private:
    GenericObject* m_dataMembers;
}

In this way, you will have some problems. First, you will need a generic object class, to store any kind of variables. Then you will need to know where each variable are positioned in your data structure and then you will need to know the type of each variable in your data structure, to convert correctly in getters. What he actually do in the presentation is reduce his class size, using references, to make it fit better in a cache page, and reduce usage in cache, not in main memory. I hope I haven't misunderstood you.

Pubescent answered 11/9, 2012 at 14:53 Comment(2)

I don't think you understood my concept correctly. BTW reducing the class was only the first step of the optimization, which didn't deliver nearly as much performance improvement as moving to a flat memory model. Basically my idea is instead of having Objects and Members together like O1M1M2M3M4 O2M1M2M3M4 O3M1M2M3M4 we have an alignment like O1O2O3 M1M1M1 M2M2M2 M3M3M3 M4M4M4 so in the pass that needs M1 members only, in the cache we get more of the objects and more of their M1 members, without polluting it with M2, M3 and M4 members. – Roue 11/9, 2012 at 17:23

Can you explain further what you mean by "Variables are stored randomly in main memory, even objects attributes"?. I've never heard of compilers placing data members randomly in memory. See this answer for example. In any case, contiguousness can be guaranteed with containers such as std::vector or custom allocators. – Hiroko 6/11, 2013 at 4:10

The way I see is that polymorphism at the object level is inherently expensive if you use it at the level of very fine, granular objects, like an abstract IPixel interface. In that case a video processing software revolving around IPixel dependencies would be pretty screwed from an efficiency standpoint as it has no breathing room to optimize. Besides the cost of dynamic dispatch per pixel, even the virtual pointer required here might be larger than the entire pixel itself, doubling or tripling memory use. Further, we can no longer play with pixel representations in ways that go beyond that of a single pixel, and, most horrific, the neighboring pixels in an image might not even be represented contiguously in memory.

Meanwhile IImage can offer plenty of room for optimization, as an image models a collection/container of pixels, and still with plenty of flexibility (ex: a different concrete image representation for each pixel format). Now dynamic dispatch per image is cheap and the size of a virtual pointer is negligible for an entire image. We can also explore how we represent pixels to our heart's content in ways that allow us to process multiple pixels at a time efficiently. So I see it like that, similar to you, as a matter of designing objects at an appropriate level of coarseness, which often implies collections of things, to reduce all the overhead and the barriers that presents to optimizations.

What about instead of objects storing their own data members, they only store a reference to the collections, where their data members are stored sequentially in their own containers, and their member methods return data from those containers, this way the odds of unneeded data ending up on its way to the CPU should be reduced and the odds of data, needed in the near "future" is increased.

I like this idea but you can kind of work your way back towards custom memory allocators and sorting base pointers if you take it too far for a polymorphic context. Where I often find use for this kind of design is to improve convenience of working with a single element in cases where it needs to be aggregated for efficiency (one case is into a container utilizing an SoA representation, another I'll cover below).

Polymorphic cases don't necessarily benefit that much because the inherent problem lies with non-homogeneous processing of granular things one-at-a-time. To restore efficiency we have to restore homogeneity for the critical loops.

Non-Homogeneous Critical Loops

Take an example of Orc inherits Creature and Human inherits Creature, Elf inherits Elves, but humans and orcs and elves have different sizes/fields, different alignment requirements, and different vtables. In that case, when the client code wants to process of non-homogeneous list of them storing polymorphic base pointers to creatures like so:

for each creature in creatures:
     creature.do_something();

... as opposed to this which would sacrifice polymorphism:

for each orc in orcs:
     orc.do_something();
for each human in humans:
     humans.do_something();
for each elf in elves:
     elves.do_something();

... which would be a real PITA to extend if we need to do this in many places each time we introduce a new type of creature...

... then if we want to keep a polymorphic solution but still process each creature one-at-a-time in a non-homogeneous fashion, we still end up losing temporal and spatial locality regardless of whether each creature is just storing a backpointer to a container or not. We lose temporal locality on the vtables since we might be accessing one vtable at one iteration and then another vtable the next iteration. The memory access patterns here can also be random and sporadic leading to a loss of spatial locality, so we end up with cache misses galore.

So the solution to me in this case if you want inheritance and polymorphism is to abstract at the container level: Orcs inherits Creatures, Humans inherit Creatures. Elves inherits Creatures. This transfers some additional complexity to client code when it wants to express operations to perform for a specific creature, but now the above sequential loop can be written like this:

for each creatures in creature_types:
     creatures.do_something();

At which in the first iteration, maybe that's doing something to an entire list of orcs (which might be like a million orcs stored in an array). Now all the orcs in that list can be stored contiguously and we're applying homogeneous functionality to all the orcs in that list. We have a boatload of breathing room to optimize in that case without changing the design.

We still have a non-homogeneous loop here utilizing polymorphism, but it is now so, so much cheaper since we're only paying the overhead for an entire container of creatures, not for every single individual creature. The loops processing individual creatures are now homogeneous. This is analogically equivalent to using abstract IImage to, say, brighten a bunch of images (a bunch of containers of pixels) instead of doing that to one abstract pixel object which implements IPixel at a time.

Homogeneous Loops and Representations

So that shifts the heavy-lifting critical loops away from non-homogeneous loops processing all kinds of different data one-at-a-time from all over the place and moves them towards homogeneous loops processing homogeneous data stored contiguously.

And this is a general strategy I look at to interface design. If it's prone to yield hotspots in ways that are difficult to optimize, then the inherent problem as I see it is that the interface has been designed at too granular of a level (Creature, not Creatures).

So that's the way I see to approaching this problem if the desire is to utilize OOP. Where I think your type of design idea can be useful is to simplify cases where the client code must express an operation which applies to just one specific creature, at which point they can work through some kind of proxy object which points back to the container and maybe stores an index or pointer to the specific entry to make that convenient, like CreatureHandle which references a specific entry in one of the abstract Creatures containers.

Saudra answered 23/12, 2017 at 10:16 Comment(0)

Recommended topics

Hot tags