I am trying to understand data-oriented design on a simple, specific problem. Apologies in advance to data-oriented design people, if I am doing something very stupid, but I am having a hard time understanding why and where my reasoning fails.
Assume that I have a simple operation, i.e., float_t result = int_t(lhs) / int_t(rhs)
. If I keep all of the variables in their corresponding containers, e.g., std::vector<float_t>
and std::vector<int_t>
, and I use std::transform
, I get the correct result. Then, for a specific example where using float_t = float
and using int_t = int16_t
, I assume that packing these variables inside a struct
, on a 64-bit architecture, and collecting them inside a container should yield better performance.
I reason that the struct
makes up a 64-bit object, and a single memory access to the struct
will give me all the variables I need. On the other hand, when all these variables are collected in different containers, I will need three different memory accesses to get the information needed. Below is how I set up the environment:
#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>
using namespace std::chrono;
template <class float_t, class int_t> struct Packed {
float_t sinvl;
int_t s, l;
Packed() = default;
Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
void comp() { sinvl = float_t(l) / s; }
};
using my_float = float;
using my_int = int16_t;
int main(int argc, char *argv[]) {
constexpr uint32_t M{100};
for (auto N : {1000, 10000, 100000}) {
double t1{0}, t2{0};
for (uint32_t m = 0; m < M; m++) {
std::vector<my_float> sinvl(N, 0.0);
std::vector<my_int> s(N, 3), l(N, 2);
std::vector<Packed<my_float, my_int>> p1(
N, Packed<my_float, my_int>(0.0, 3, 2));
// benchmark unpacked
auto tstart = high_resolution_clock::now();
std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
std::divides<my_float>{}); // 3 different memory accesses
auto tend = high_resolution_clock::now();
t1 += duration_cast<microseconds>(tend - tstart).count();
if (m == M - 1)
std::cout << "sinvl[0]: " << sinvl[0] << '\n';
// benchmark packed
tstart = high_resolution_clock::now();
for (auto &elem : p1) // 1 memory access
elem.comp();
tend = high_resolution_clock::now();
t2 += duration_cast<microseconds>(tend - tstart).count();
if (m == M - 1)
std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
}
std::cout << "N = " << N << ", unpacked: " << (t1 / M) << " us.\n";
std::cout << "N = " << N << ", packed: " << (t2 / M) << " us.\n";
}
return 0;
}
The compiled code with g++ -O3
yields, on my machine,
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 1000, unpacked: 0 us.
N = 1000, packed: 1 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 10000, unpacked: 5.06 us.
N = 10000, packed: 12.97 us.
sinvl[0]: 0.666667
p1[0].sinvl: 0.666667
N = 100000, unpacked: 52.31 us.
N = 100000, packed: 124.49 us.
Basically, std::transform
beats the packed access by 2.5x
. I would appreciate if you helped me understand the behaviour. Is the result due to
- me not grasping the data-oriented design principles correctly, or,
- some artefact of this very simple example such as the memory locations getting allocated very close to each other and in some way getting optimized very efficiently by the compiler?
Finally, is there a way to beat std::transform
in this example, or, is it simply good enough to be a go-to solution? I am not an expert neither in compiler optimizations nor in data-oriented design, and thus, I could not answer this question myself.
Thanks!
EDIT. I have changed the way I test both of the methods as per @bolov's suggestion in the comments.
Now the code looks like:
#include <algorithm>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>
using namespace std::chrono;
template <class float_t, class int_t> struct Packed {
float_t sinvl;
int_t s, l;
Packed() = default;
Packed(float_t sinvl, int_t s, int_t l) : sinvl{sinvl}, s{s}, l{l} {}
void comp() { sinvl = float_t(l) / s; }
};
using my_float = float;
using my_int = int16_t;
int main(int argc, char *argv[]) {
uint32_t N{1000};
double t{0};
if (argc == 2)
N = std::stoul(argv[1]);
#ifndef _M_PACKED
std::vector<my_float> sinvl(N, 0.0);
std::vector<my_int> s(N, 3), l(N, 2);
// benchmark unpacked
auto tstart = high_resolution_clock::now();
std::transform(l.cbegin(), l.cend(), s.cbegin(), sinvl.begin(),
std::divides<my_float>{}); // 3 different memory accesses
auto tend = high_resolution_clock::now();
t += duration_cast<microseconds>(tend - tstart).count();
std::cout << "sinvl[0]: " << sinvl[0] << '\n';
std::cout << "N = " << N << ", unpacked: " << t << " us.\n";
#else
std::vector<Packed<my_float, my_int>> p1(N,
Packed<my_float, my_int>(0.0, 3, 2));
// benchmark packed
auto tstart = high_resolution_clock::now();
for (auto &elem : p1) // 1 memory access
elem.comp();
auto tend = high_resolution_clock::now();
t += duration_cast<microseconds>(tend - tstart).count();
std::cout << "p1[0].sinvl: " << p1[0].sinvl << '\n';
std::cout << "N = " << N << ", packed: " << t << " us.\n";
#endif
return 0;
}
with the corresponding shell (fish) script
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_unpacked.out
g++ -Wall -std=c++11 -O3 transform.cpp -o transform_packed.out -D_M_PACKED
for N in 1000 10000 100000
echo "Testing unpacked for N = $N"
./transform_unpacked.out $N
./transform_unpacked.out $N
./transform_unpacked.out $N
echo "Testing packed for N = $N"
./transform_packed.out $N
./transform_packed.out $N
./transform_packed.out $N
end
which gives the following:
Testing unpacked for N = 1000
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
sinvl[0]: 0.666667
N = 1000, unpacked: 0 us.
Testing packed for N = 1000
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
p1[0].sinvl: 0.666667
N = 1000, packed: 1 us.
Testing unpacked for N = 10000
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
sinvl[0]: 0.666667
N = 10000, unpacked: 5 us.
Testing packed for N = 10000
p1[0].sinvl: 0.666667
N = 10000, packed: 17 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
p1[0].sinvl: 0.666667
N = 10000, packed: 13 us.
Testing unpacked for N = 100000
sinvl[0]: 0.666667
N = 100000, unpacked: 64 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
sinvl[0]: 0.666667
N = 100000, unpacked: 66 us.
Testing packed for N = 100000
p1[0].sinvl: 0.666667
N = 100000, packed: 180 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 198 us.
p1[0].sinvl: 0.666667
N = 100000, packed: 177 us.
I hope I have understood the proper testing method, correctly. Still, though, the difference is 2-3 folds.
N
, in this example,M
times and get the mean of the timings later on? If so, I will do that and report the new timings. I hope this explains the huge difference, that is,2.5x
. Thanks for the hint! – Netsukestd::transform
case, compiler is able to emit SIMD instructions. Note: your solution should not yield better performance. Cache is able to provide the data in the same time for both solutions in your case. If you've used more data-streams (more than cache-ways), then it could be the case thestd::transform
variant could be slower. – CobdenPacked.comp
case, it has to move data around to be able to use SIMD division. Maybe that's the cause of the difference. (but on my computer, with clang 6, the difference is only 60%, not 2x-3x) – Cobdennm a.out | c++filt | grep comp
to see if the function call gets optimized out (which it does), but I do not know how to observe if that thing has to move data around. Want to provide an answer to the question that I can select? – Netsuke-mno-sse
. This will disable SSE instructions (SIMD). It will be interesting to see if this way packed is faster. – Lyallg++
giveserror: SSE register return with SSE disabled
, whereasclang++
compiles properly. However, the output ofclang++ -mno-sse
compiled code is garbage --- I seenan
s and zeros. – Netsuke-mno-sse
tricky to use on this architecture. – Illyrianstd::transform
is traversing some medium to large vectors (two input, one output) sequentially, and modern CPUs can easily predict sequential accesses and pre-fetch the corresponding cache lines. There is a limit to the number of vectors, of course, but it's certainly no less than three… – Illyrianfor
loop based on the sameidx
accessing different memory locations, or (2) packing them instruct
s like this and use a single memory access. This boils down to, I assume, what @Cobden was referring to --- SoA vs AoS. However, apparently, this minimal working example is too simple for these purposes :) Thanks, again, for the second explanation on access predictions of CPU's. – Netsuke