The motivation for this question
The unaligned load is generally more common to use. The developer should use the aligned SIMD load when the address is already aligned. So I started to wonder if there are some performance differences between these two function calls on an already aligned address. The intuitive guess is that the aligned load is faster than the unaligned load.
I do know this question can be very hardware-dependent. Another motivation point is that Zen4 is the first AMD microarchitecture offering AVX-512, so I want to try some AVX-512 on Zen4 and see the results.
The benchmark code and the assembly
The code: https://godbolt.org/z/W3qvcjGWs
I benchmark with two cases:
- The first case: I ensure that the memory to be accessed
data
has a size less than the L1 cache. So I have no cache misses; therefore not memory bound. - The second case: the accessed memory is more larger than cache.
The only difference between the function calls in assembly:
vmovdqa64
andvmovdqu64
.
The result
My experiment was conducted on AMD Zen4. I benchmarked the function call ten times. The result is consistent, and it turns out that these two function calls are the same fast. It is against my intuition. If it is true, then there is no usage case for the actual aligned load, which has a minimal scenario and leads to a seg-fault on an unaligned address.