C++ Classes for High Performance Computing

Asked 1/9, 2020 at 17:1 Answered 1/9, 2020 at 17:16

One of the simplest rules of thumb is to remember that hardware loves arrays, and is highly optimized for iteration over arrays. A simple optimization for many problems is just to stop using fancy data structures and just use plain arrays (or std::vectors in C++). This can take some getting used to.

Are C++ classes one of those "fancy data structures," i.e. a kind of data type that can be replaced by arrays to achieve a higher performance in a C++ program?

Nonflammable answered 1/9, 2020 at 17:1 Comment(2)

"Are C++ classes one of those "fancy data structures,"" - A std::vector is a class and it's not "fancy" and explicitly mentioned as something to use, so no. You cannot just generalize like that. A class (or struct - same thing) can be fancy/complicated or completely trivial.; It all depends on what's in it / how it is implemented. Just it being a class means nothing on its own. – Noonday 1/9, 2020 at 17:16

Edited my answer with more info. – Clarkclarke 1/9, 2020 at 17:43

If your class looks like this:

struct Person {
  double age;
  double income;
  size_t location;
};

then you might benefit from rearranging to

std::vector<double> ages;
std::vector<double> incomes;
std::vector<size_t> locations;

But it depends on your access patterns. If you frequently access multiple elements of a person at a time, then having the elements blocked together makes sense.

If your class looks like this:

struct Population {
  std::vector<double> many_ages;
  std::vector<double> many_incomes;
  std::vector<size_t> many_locations;
};

Then you're using the form your resource recommended. Using any one of these arrays individually is faster than using the first class, but using elements from all three arrays simultaneously is probably slower with the second class.

Ultimately, you should structure your code to be as clean and intuitive as a possible. The biggest source of speed will be a strong understanding and appropriate use of algorithms, not the memory layout. I recommend disregarding this unless you already have strong HPC skills and need to squeeze maximum performance from your machine. In almost every other case your development time and sanity is worth far more than saving a few clock cycles.

More broadly

An interesting paper related to this is SLIDE: In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems. A lot of work has gone into mapping ML algorithms to GPUs and, for ML applications, getting the memory layout right does make a real difference since so much time is spent on training and GPUs are optimized specifically for contiguous-array processing. But, the authors of the paper contend that even here if you understand algorithms well you can beat specialized hardware with optimized memory layouts, and they demonstrate this by getting their CPU to train 3.5x faster than their GPU.
More broadly, your question deals with the idea of cache misses. Since a cache miss is 200x more expensive than an L1 reference (link), if your data layout is optimized to your computation, then you can really save time. However, as the above suggests, it is rarely the case that simply rearranging your data magically makes everything faster. Consider matrix multiplication. It's the perfect example because the data is laid out in a single array, as requested by your resource. However, for a simple triple-loop matmult GEMM implementation there are still 6 ways to arrange your loops. Some of these ways are much more efficient than others, but none of them give you anywhere near peak performance. Read through this step-by-step explanation of matmult to get a better sense of all the algorithmic optimizations necessary to get good performance.

What the above should demonstrate is that even for situations in which we have only a few arrays laid out exactly as your resource suggests, the layout alone doesn't give us the speed. Good algorithms do. Data layout considerations, if any, flow from the algorithms we choose and higher-level hardware constraints.

If this is so for simple arrays and operations like matrix multiplication, by extension you should also expect it to be so for "fancy data structures" as well.

Clarkclarke answered 1/9, 2020 at 17:15 Comment(0)

Are C++ classes one of those "fancy data structures,"

I think they are referring in particular to containers like std::map, std::deque, std::list etc, that hold data in many different heap allocations and therefore iterating over the contents of the container requires the CPU to "hop around" in the RAM address space to some extent, rather than just reading RAM sequentially. It's that hopping-around that often limits performance, as the CPU's on-board memory caches are less effective at avoiding execution-stalls due to RAM latency when future RAM access locations aren't easily predictable.

A C++ class on its own may or may not encourage non-sequential RAM access; whether it does or not depends entirely on how the class was implemented (and in particular on whether it is holding its data via multiple heap allocations). The std::vector class (mentioned in the forum text) is an example of a C++ class that does not require any non-sequential memory accesses as you iterate across its contents.

Surovy answered 1/9, 2020 at 17:5 Comment(0)

Are C++ classes one of those "fancy data structures,

A C++ class is a construct which can be used to create a data type. It can be used to create data structures such as lists, queues etc.

i.e. a kind of data type

A class is a data type

that can be replaced by arrays

A class and array are not interchangeable. Arrays are data structures. You are comparing apples with oranges.

to achieve a higher performance in a C++ program?

That depends on how you implement your class

Calutron answered 1/9, 2020 at 17:16 Comment(0)

Are C++ classes one of those "fancy data structures," i.e. a kind of data type that can be replaced by arrays to achieve a higher performance in a C++ program?

Both computer time and your development time are valuable.

Don't optimize code unless you are sure it is taking most of the CPU time.

So use first a profiler (e.g. Gprof) and read the documentation of your C or C++ compiler (e.g. GCC). Compilers are capable of fancy optimizations.

If you really care about HPC, learn about GPGPU programming with e.g. OpenCL or OpenACC.

If you happen to use Linux (a common OS in the HPC world), read Advanced Linux Programming, then syscalls(2), then time(7).

Apprehension answered 1/9, 2020 at 17:15 Comment(0)

Both computer time and your development time are valuable.

Don't optimize code unless you are sure it is taking most of the CPU time.

Recommended topics

Hot tags