I need to run a multi-threaded matrix-vector multiplication every 500 microseconds. The matrix is the same, the vector changes every time.
I use Intels sgemv() in the MKL on a 64-core AMD CPU. If I compute the multiplications in a for-loop with no gaps in a little test program, it takes 20 microseconds per call of sgemv(). If I add a spin loop (polling the TSC) that takes about 500 microseconds to the for-loop, the time per sgemv() call increases to 30 microseconds if I use OMP_WAIT_POLICY=ACTIVE, with OMP_WAIT_POLICY=PASSIVE (the default), it goes even up to 60 microseconds.
Does anybody know what could be going on and why it is slower with the breaks? And what can be done to avoid this?
It doesn't seem to make a difference whether the spin loop is single-threaded or in a "#pragma omp parallel" context. It also makes no difference whether I keep the AVX units busy or not in the spin loop. CPU cores are isolated and the test program is running at a high priority and with SCHED_FIFO (on Linux, this is).
Spin wait function:
static void spin_wait(int num)
{
uint64_t const start = rdtsc();
while( rdtsc() - start < num )
{;}
}
for-loop
uint64_t t0[num], t1[num];
for( int i=0; i<num; i++ )
{
// modify input vector, just incrementing each element
t0[i] = rdtsc();
cblas_sgemv(...);
t1[i] = rdtsc();
spin_wait( 500us );
}
OMP_WAIT_POLICY=PASSIVE
, this is an expected behaviour unfortunately. See this previous post about a similar problem. I do not have any explanation for the two others yet. I wonder if this could be due to the power consumption of therdtsc
loop impacting the frequency of the cores. Can you try to stabilize the frequency of the cores as pointed out in the provided answer? Note you should certainly choose a relatively low frequency so to avoid an overheat. The turbo must be disabled too for the check. – Mesnalty_mm_pause
intrinsic in your loop or even a loop doing dozens of call to them? I might help to reduce a bit the power consumption assuming this is a problem withrdtsc
. AFAIK,_mm_pause
does not impact the frequency so the above test is still useful and can be combined with this one if this is not enough to see any impact. – Mesnaltyperf
to do some checks about the frequency, the cache misses, and more lower-level stuff like the frontend/backend usage. – Mesnalty