Is there any data on AVX2 gather latency?
(for instance a _mm256_i32gather_ps instruction accessing a single cache line)
Is there any data on AVX2 gather latency?
(for instance a _mm256_i32gather_ps instruction accessing a single cache line)
This page gives latency data for all intrinsics:
The latency for _mm256_i32gather_ps is 6.
Actually, this really depends on the hardware. If you look at Agner Fog's instruction tables, you'll see that there are no latencies listed for Zen1 and Zen2, but have reciprocal throughputs of 13-20 and 9-16 for VGATHERDPS. For Intel processors we have:
xmm ymm
Processor throughput latency throughput latency
-------------------------------------------------------
Haswell 9 12
Broadwell 6 7
Skylake 4 12 5 13
SkylakeX 4 12 5 13
Coffee Lake 4 12 5 13
Also, Intel's site no longer lists the throughput/latencies of of the gather instructions for AVX2, but there are some for AVX512.
for instance a _mm256_i32gather_ps instruction accessing a single cache line)
There's an extremely odd detail about the gather instructions on Intel architectures: They are non-temporal loads that work on any memory type, but unlike your ordinary non-temporal loads they don't just avoid polluting L2 and L3 data caches - they also don't result in changes to the 1st-level TLB cache. (They do appear to update the 2nd level TLB cache though.)
So the answer is: Even when accessing the same cache line from all reads, you can end up with each single read not only missing the cache, but also each single read triggering a page walk to compensate for the missing TLB entry. Giving you latencies worse than any other instruction I'm aware of, if the data wasn't already preftched and rendering this instruction extremely situational.
© 2022 - 2025 — McMap. All rights reserved.