Is there any performance difference between AVX-512 `_mm512_load_epi64` and `_mm512_loadu_epi64`?
Asked Answered
A

1

1

The motivation for this question

The unaligned load is generally more common to use. The developer should use the aligned SIMD load when the address is already aligned. So I started to wonder if there are some performance differences between these two function calls on an already aligned address. The intuitive guess is that the aligned load is faster than the unaligned load.

I do know this question can be very hardware-dependent. Another motivation point is that Zen4 is the first AMD microarchitecture offering AVX-512, so I want to try some AVX-512 on Zen4 and see the results.

The benchmark code and the assembly

The code: https://godbolt.org/z/W3qvcjGWs

I benchmark with two cases:

  • The first case: I ensure that the memory to be accessed data has a size less than the L1 cache. So I have no cache misses; therefore not memory bound.
  • The second case: the accessed memory is more larger than cache. The only difference between the function calls in assembly: vmovdqa64 and vmovdqu64.

The result

My experiment was conducted on AMD Zen4. I benchmarked the function call ten times. The result is consistent, and it turns out that these two function calls are the same fast. It is against my intuition. If it is true, then there is no usage case for the actual aligned load, which has a minimal scenario and leads to a seg-fault on an unaligned address.

Arianearianie answered 13/12, 2022 at 13:5 Comment(0)
H
3

No downside on aligned data

If the memory is aligned (so load can work instead of faulting), performance is identical. Intel and AMD CPUs have been this way since Nehalem and K10. (Or Bulldozer for movdqu stores also running the same as alignment-required, when the data is aligned.)

This hasn't changed with 512-bit vectors, _mm512_loadu_si512 is fast on aligned data.

If your data is usually aligned, unaligned loads are an excellent choice; no overhead of extra instructions checking alignment for the common case, and hardware handles it not terribly in the rare cases. If you want to fail noisily to detect if data is ever misaligned, use aligned loads (and compile with GCC or clang, not MSVC or ICC which never use alignment-required loads/stores in the asm).

To actually test for misalignment, you may need to disable optimization (at huge performance code), if the compiler folds your load intrinsics into memory source operands for other instructions. Verifying alignment expectations is the use-case for vmovdqa64 and friends.


If your data actually is misaligned most of the time, for 256-bit vectors it can still be ok to let the CPU handle it. (Beware Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? with GCC's default tuning before GCC11). It may be worth it to have extra code that has to run every time that checks for alignment and maybe does an unaligned first vector, possibly overlapping with the first aligned vector if your SIMD operation is ok with that.

See How can I accurately benchmark unaligned access speed on x86_64? for some details on the performance effects, and What's the actual effect of successful unaligned accesses on x86?.


Misaligned 512-bit vectors are extra slow

The interesting thing about 512-bit vectors is that any misalignment necessarily means a cache-line split, impossible to be misaligned but still contained within a single 64-byte cache line like it is for 16-byte and 32-byte vectors. (In those cases, Intel since Haswell and AMD since Zen 2 or 3 I think(?) still have full performance for unaligned load/store.) See https://uops.info/ and https://agner.org/optimize/ , and https://travisdowns.github.io/blog/2019/06/11/speed-limits.html re: which sub-cache-line boundary crossings matter for loads vs. stores on Zen 1, 2, and 3, e.g. crossing a 32-byte boundary can matter.

At least for Intel CPUs, aligning your data is quite useful for AVX-512 even if your code bottlenecks on DRAM bandwidth; apparently unaligned 64-byte loads that actually are misaligned at runtime lead to lower per-core memory bandwidth, by maybe 15% or so, vs. only a couple percent for code using 256-bit vectors, if that.

Housefly answered 13/12, 2022 at 16:11 Comment(1)
64-Byte misaligned loads usually () run at one load per cycle on SKX/CLX processors -- effectively using both 64-Byte L1D cache read interfaces to grab the consecutive cache lines, then shifting/merging the results into the 512-bit destination register. For reasons that are not completely clear, even this 1/2-cycle penalty leads to a non-negligible drop in throughput for data further out in the memory hierarchy... () There is a larger penalty for misaligned loads that cross a page boundary.Empale

© 2022 - 2024 — McMap. All rights reserved.