Intel MKL multi-threaded matrix-vector multiplication sgemv() slow after little breaks
Asked Answered
P

1

4

I need to run a multi-threaded matrix-vector multiplication every 500 microseconds. The matrix is the same, the vector changes every time.

I use Intels sgemv() in the MKL on a 64-core AMD CPU. If I compute the multiplications in a for-loop with no gaps in a little test program, it takes 20 microseconds per call of sgemv(). If I add a spin loop (polling the TSC) that takes about 500 microseconds to the for-loop, the time per sgemv() call increases to 30 microseconds if I use OMP_WAIT_POLICY=ACTIVE, with OMP_WAIT_POLICY=PASSIVE (the default), it goes even up to 60 microseconds.

Does anybody know what could be going on and why it is slower with the breaks? And what can be done to avoid this?

It doesn't seem to make a difference whether the spin loop is single-threaded or in a "#pragma omp parallel" context. It also makes no difference whether I keep the AVX units busy or not in the spin loop. CPU cores are isolated and the test program is running at a high priority and with SCHED_FIFO (on Linux, this is).

Spin wait function:

static void spin_wait(int num)
{
  uint64_t const start = rdtsc();
  while( rdtsc() - start < num )
  {;}
}

for-loop

uint64_t t0[num], t1[num];
for( int i=0; i<num; i++ )    
{
  // modify input vector, just incrementing each element

  t0[i] = rdtsc();
  cblas_sgemv(...);
  t1[i] = rdtsc();
  spin_wait( 500us );
}
Professional answered 23/2, 2023 at 18:7 Comment(14)
Which "breaks" are you talking about? Can you provide an example of code to clarify what you are doing and so to be more precise? Any details matters a lot at this time granularity.Mesnalty
I added some example code. Without the spin_wait(), t1[i]-t0[i] is 20 microseconds in average, with the spin_wait() it is 30-60 microseconds, depending on OMP_WAIT_POLICY.Professional
For the OMP_WAIT_POLICY=PASSIVE, this is an expected behaviour unfortunately. See this previous post about a similar problem. I do not have any explanation for the two others yet. I wonder if this could be due to the power consumption of the rdtsc loop impacting the frequency of the cores. Can you try to stabilize the frequency of the cores as pointed out in the provided answer? Note you should certainly choose a relatively low frequency so to avoid an overheat. The turbo must be disabled too for the check.Mesnalty
Note that 20 µs for this computation is very very small if it is a parallel one. It takes time to communicate between the 64 cores and do a barrier. I expect this overhead to be about ~10 µs on such a machine. In fact, a simple contended atomic access taking 40 ns per access would last for 2.5 µs when applied on 64 cores. OpenMP needs multiples atomic accesses (though some runtime use a tree-based reduction to avoid such critical cases). Still, one need to consider cache line sharing, NUMA effects, and so on... I may be a good idea to use only few cores like only a CCX/CCD.Mesnalty
Finally, can you try to add a _mm_pause intrinsic in your loop or even a loop doing dozens of call to them? I might help to reduce a bit the power consumption assuming this is a problem with rdtsc. AFAIK, _mm_pause does not impact the frequency so the above test is still useful and can be combined with this one if this is not enough to see any impact.Mesnalty
I forgot to mention that this computer is highly tuned for low latency, both on the hardware (BIOS), and OS level. Everything that could cause indeterminism is switched off, no frequency scaling for example. AFAIK, AMD Zen 2 do not scale down frequencies for AVX instructions.Professional
I see the same effect if I mostly do some AVX additions on a few AVX registers over and over in spin_wait() and only rarely rdtsc().Professional
For some application, 20 microseconds are ages :) While I do believe that 20 microseconds is not the minimum for this MVM, it's a different topic :)Professional
NUMA, the Zen 2 architecture, overhead etc are all good thoughts, but I can't relate that why things take longer after the spin_wait(). I'll try _mm_pause(). Using less cores, sgemv() takes longer. Peak seams to be reached at 61 cores for my particular size and 64 cores is equally fast.Professional
Note that you said that the machine is "highly tuned for low latency" but this means the processor should operate at a high frequency then, but processors cannot use the same high frequency when many cores are used because they would not reach the power budget. The processor needs to use a lower frequency when many cores are use and this is not something you can tune in software AFAIK. The only way to fix this non-deterministic behaviour is to use a low frequency which is not great for high-latency. AFAIK you cannot have both low-latency and determinism.Mesnalty
If you use a low frequency (the one used by de processor when all cores are using AVX heavy instructions like DP-FP ones) AND the turbo frequency (which can still be enabled at a lower frequency on some CPUs), then the frequency should be 100% stable.Mesnalty
You can use perf to do some checks about the frequency, the cache misses, and more lower-level stuff like the frontend/backend usage.Mesnalty
The CPUs run at their nominal frequency, there is not scaling up or down. Determinism is a prerequisite for low-latency. If a system isn't deterministic, it can't be low-latency. They go together well. Note, that tens of microseconds is anything but low-latency. Btw, AMD's CPUs don't need to slow down for AVX, AFAIK, as some Intels do. But even the AVX down-clocking is save in my experience at the latency levels I'm interested in.Professional
If this isn't anything MKL or OpenMP does, I'm wondering if the CPU's prefetcher and branch predictor could loose its memory in the spin_wait()...? The data shouldn't be flushed from the caches, I'd expect.Professional
L
-3

Might have something to do with context switching since you are not using a „real“ real time OS. Might also be something cache relate (or both). Depending on the prediction algorithms and the size of your problem cache prefetching might simply work better if your code is still „hot“ and you repeat it thousand of times subsequently (even if a us range seems quite large for a cache related cause imho, maybe if ram access is additionally involved). I would also still not exclude frequency scaling as the cause since the processor might run into a power limit forcing it to scale down a bit (AVX2 instructions are usually quite power hungry…)

Lewiss answered 31/3, 2023 at 0:56 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.