Is there any data on the latency of an AVX2 gather instruction?

Asked 22/7, 2013 at 14:18 Answered 4/4, 2024 at 18:7

Solved performance x86 latency micro-optimization avx2

Is there any data on AVX2 gather latency?

(for instance a _mm256_i32gather_ps instruction accessing a single cache line)

Parrott answered 22/7, 2013 at 14:18 Comment(1)

Just one empirical data point - I ran a quick benchmark for a gathered load recently and throughput was pretty bad - I was loading a split vector, so the first half of the vector came from one cache line and the second half from another - it seemed to take quite a few cycles. – Baby 22/7, 2013 at 15:36

This page gives latency data for all intrinsics:

Intel Intrinsics Guide

The latency for _mm256_i32gather_ps is 6.

Pot answered 17/2, 2014 at 12:8 Comment(2)

NB: those are the minimum latencies. – Bureau 8/7, 2016 at 0:35

No way, the minimum latency is something like 17 cycles for the smallest gathers (2 elements) and 22 cycles for the large ones (8 elements like the DD forms), at least if you measure in the usual way from address input to result. – Froze 19/6, 2019 at 20:17

Actually, this really depends on the hardware. If you look at Agner Fog's instruction tables, you'll see that there are no latencies listed for Zen1 and Zen2, but have reciprocal throughputs of 13-20 and 9-16 for VGATHERDPS. For Intel processors we have:

                     xmm                 ymm
Processor    throughput latency  throughput latency
-------------------------------------------------------
Haswell          9                    12
Broadwell        6                     7
Skylake          4         12          5       13
SkylakeX         4         12          5       13
Coffee Lake      4         12          5       13

Also, Intel's site no longer lists the throughput/latencies of of the gather instructions for AVX2, but there are some for AVX512.

Hatch answered 11/11, 2020 at 8:15 Comment(4)

Also worth checking uops.info - they publish the microbenchmark instruction sequences they used, and automate the tests. – Shaeffer 11/11, 2020 at 8:18

@PeterCordes Wow. Thanks for that amazing resource. Almost too much information in those tables. – Hatch 11/11, 2020 at 8:26

I always turn off the "documentation" and "IACA" columns; I don't care about them being sometimes wrong, I just want real measurements. That cuts down clutter enough to compare data from a couple microarchitectures. – Shaeffer 11/11, 2020 at 8:44

It appears to have gotten worse in Comet Lake - checking the performance counter shows 100% 1st level TLB cache misses when using any of the gather instructions, pushing this to a latency of way beyond 100+ cycles. Meanwhile, doing the same loads via scalar ports runs approximately 50x times faster. – Coquillage 4/4, 2024 at 15:31

for instance a _mm256_i32gather_ps instruction accessing a single cache line)

There's an extremely odd detail about the gather instructions on Intel architectures: They are non-temporal loads that work on any memory type, but unlike your ordinary non-temporal loads they don't just avoid polluting L2 and L3 data caches - they also don't result in changes to the 1st-level TLB cache. (They do appear to update the 2nd level TLB cache though.)

So the answer is: Even when accessing the same cache line from all reads, you can end up with each single read not only missing the cache, but also each single read triggering a page walk to compensate for the missing TLB entry. Giving you latencies worse than any other instruction I'm aware of, if the data wasn't already preftched and rendering this instruction extremely situational.

Coquillage answered 4/4, 2024 at 18:7 Comment(3)

Could you link the source of this information? – Yearly 6/4, 2024 at 6:13

@AlexGuteniev will have to make some reproducible benchmark. It popped up as the only possible conclusion when inspecting performance counters on Comet Lake with VTune, where it was showing excessive stalls due to unreasonably high primary TLB cache misses even on repeated access to the same data. I could not find any documentation myself either on this behavior. – Coquillage 6/4, 2024 at 7:58

@AlexGuteniev I had more than one page (but far less than TLB or L1 data cache size) worth of LUT data in the data segment when reproducing this, and I didn't touch the data with anything other than gather instructions. So the TLB was guaranteed to be still pristine. – Coquillage 6/4, 2024 at 8:2

Recommended topics

Hot tags