The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?
The longer version is that, consider a function,
void func(size_t n, void *in, void *out)
{
__m256i *in256 = reinterpret_cast<__m256i *>(in);
__m256i *out256 = reinterpret_cast<__m256i *>(out);
while (n >= 32) {
__m256i data = _mm256_loadu_si256(in256++);
// process data
_mm256_storeu_si256(out256++, data);
n -= 32;
}
// process the remaining n % 32 bytes;
}
If in
and out
are both 32-bytes aligned, then there's no penalty of using vmovdqu
instead of vmovdqa
. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.
In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?
double x; double *p = &x; double y = p[1ULL << 64];
will almost certainly create a segment fault. So why it is safe to load pass the end of the vector? – Roundshoulderedp & 0xF
), and then always read the full alignment/register size (i.e. 16 bytes) from there, then you are not only correctly aligned, but you are also guaranteed never to page fault by exceeding the start- or end-bounds, given any valid number of bytesc
. This is because each accessible byte must exist in some fully-valid 4K page. But note that right away with the first read, the data of interest may not be "aligned" in the register. – Esprit