Unaligned load versus unaligned store

About

Asked 1/12, 2016 at 20:27 Answered 9/1, 2017 at 13:38

The short question is that if I have a function that takes two vectors. One is input and the other is output (no alias). I can only align one of them, which one should I choose?

The longer version is that, consider a function,

void func(size_t n, void *in, void *out)
{
    __m256i *in256 = reinterpret_cast<__m256i *>(in);
    __m256i *out256 = reinterpret_cast<__m256i *>(out);
    while (n >= 32) {
         __m256i data = _mm256_loadu_si256(in256++);
         // process data
         _mm256_storeu_si256(out256++, data);
         n -= 32;
    }
    // process the remaining n % 32 bytes;
}

If in and out are both 32-bytes aligned, then there's no penalty of using vmovdqu instead of vmovdqa. The worst case scenario is that both are unaligned, and one in four load/store will cross the cache-line boundary.

In this case, I can align one of them to the cache line boundary by processing a few elements first before entering the loop. However, the question is which should I choose? Between unaligned load and store, which one is worse?

Roundshouldered answered 1/12, 2016 at 20:27 Comment(12)

Have a look at some memcpy implementations; I think there is a usual way, but I forget which it is. Although maybe it depends what you're doing. Aligned loads will avoid cache-line boundaries, so no load-use latency penalties (not very relevant if the pointer increment is predictable, since OOO can have the load addresses ready far ahead of the rest of the loop). Since reading outside an object is often safe, but writing isn't, that can maybe affect the decision if you can avoid a full scalar version for a cleanup loop. – Agonistic 2/12, 2016 at 1:3

I ran a few tests on this a while back, and determined that, at least on the processors I tested on (Pentium 4, Core 2, Sandy Bridge, and Haswell), aligning the input vector was noticeably faster than aligning the output vector. Your mileage may vary. I don't feel comfortable posting this as an answer because I no longer have the test code, don't feel like writing it and running the tests again, and don't have an official reference to point to in any sort of documentation. So have an upvote instead! :-) – Fizz 2/12, 2016 at 14:8

@CodyGray Thanks anyway. I have been working on some tests of this problem. So far what I can tell is only that "it depends" – Roundshouldered 2/12, 2016 at 18:34

@PeterCordes Intuitively I'd guess it's the store that should be aligned. On Haswell, for instance, the read bandwidth is double the store bandwidth, so the processor can fetch the two parts of a misaligned block simultaneously with the writeout of a previous block. And on Haswell misalignment penalties have been nearly eliminated. – Bohannon 4/12, 2016 at 20:25

@PeterCordes Could you please elaborate on "reading outside an object is often safe". Would it be possible, if not always, to trigger a segment fault? And thus the processor cannot assume that it is safe to start the next read before the condition check (such as size check) at the end of loop finish? – Roundshouldered 4/12, 2016 at 21:22

@YanZhou What Peter is saying is that an implementation that overwrites some data after the end of the destination buffer risks trashing important data. Even if it repairs the damage, it's possible that another thread saw a corrupt value momentarily. In general, the act of reading anything is less dangerous than the act of writing it, because one doesn't alter the state of said thing. – Bohannon 5/12, 2016 at 0:39

@IwillnotexistIdonotexist Thanks for the comment. I understand why it is dangerous to write pass border and it is less dangerous so to read. What I did not understand is that, can a load also be dangerous? For example, what happens when the load not only pass the border of the object, but also the memory allocated to the program, or the stack. For example, in C++, double x; double *p = &x; double y = p[1ULL << 64]; will almost certainly create a segment fault. So why it is safe to load pass the end of the vector? – Roundshouldered 5/12, 2016 at 1:1

@YanZhou Yes, a load can be dangerous when it crosses a page boundary (on x86, 4KB-aligned addresses). If the next page is unmapped, a read will segfault. – Bohannon 5/12, 2016 at 1:3

@IwillnotexistIdonotexist Thanks for the clarification. – Roundshouldered 5/12, 2016 at 1:6

@CodyGray I've found the opposite to be true when the data is large enough to use non-temporal stores. So you align the output buffer to use NT-stores. And read from an unaligned input buffer. – Kaleidoscope 5/12, 2016 at 19:53

That is very possibly the case for non-temporal stores. I did not test that at all. Which just goes to show how complicated of a question this is. Writing a really good answer won't be as simple as concocting a straightforward test case and pasting in your results. – Fizz 6/12, 2016 at 6:12

@IwillnotexistIdonotexist Given an arbitrary pointer, if you round down from the given start address to find the initial alignment (i.e., p & 0xF), and then always read the full alignment/register size (i.e. 16 bytes) from there, then you are not only correctly aligned, but you are also guaranteed never to page fault by exceeding the start- or end-bounds, given any valid number of bytes c. This is because each accessible byte must exist in some fully-valid 4K page. But note that right away with the first read, the data of interest may not be "aligned" in the register. – Esprit 11/3, 2023 at 8:49

Risking to state the obvious here: There is no "right answer" except "you need to benchmark both with actual code and actual data". Whichever variant is faster strongly depends on the CPU you are using, the amount of calculations you are doing on each package and many other things.

As noted in the comments, you should also try non-temporal stores. What also sometimes can help is to load the input of the following data packet inside the current loop, i.e.:

__m256i next =  _mm256_loadu_si256(in256++);
for(...){
    __m256i data = next; // usually 0 cost
    next = _mm256_loadu_si256(in256++);
    // do computations and store data
}

If the calculations you are doing have unavoidable data latencies, you should also consider calculating two packages interleaved (this uses twice as many registers though).

Subject answered 9/1, 2017 at 13:38 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags