I am conducting a test to measure the message synchronization latency between different cores of a CPU. Specifically, I am measuring how many clock cycles it takes for CPU2 to detect changes in the shared data made by CPU1. Both CPU1 and CPU2 are using rdtsc
instruction to record the timing. I have observed inconsistent behavior between Intel and AMD CPU platforms. Any thoughts or suggestions you may have regarding this issue is welcome.
The code is as follows. The program contains two threads, "ping" and "pong," running on CPU1 and CPU2 respectively. As their names suggest, they take turns incrementing their own shared data (shd_data
) to implement a ping-pong loop and finally measure the total running time (the minimum of ts_end-ts_start
and tb_end-tb_start
). In order to also measure the average time for the "ping" operations, I added rdtsc(tsc_start[loop]);
and rdtsc(tsc_end[loop])
; within the while loop. However, these two lines of monitor code significantly affected the total running time, increasing the average ping-pong loop time from about 180 cycles to 290 cycles. I cannot find a reasonable explanation for this.
#define MAX_LOOPS 10
#define BIDX 0
#define SIDX 15
static volatile int start_flag = 0;
static int shd_data[16];
unsigned long tb_start , tb_end, ts_start, ts_end;
unsigned long gap[12];
unsigned long tsc_start[MAX_LOOPS];
unsigned long tsc_end[MAX_LOOPS];
void* ping(void *args)
{
int loop=0;
int old = 0;
int cur = 0;
memset(tsc_start, 0, MAX_LOOPS*sizeof(unsigned long));
memset(shd_data, 0, 64); // preheat
while (start_flag == 0) ; // sync. start
rdtsc(tb_start);
while (++loop < MAX_LOOPS) {
rdtsc(tsc_start[loop]);
WRITE_ONCE(shd_data[BIDX], shd_data[BIDX]+1);
do {
cur = READ_ONCE(shd_data[SIDX]);
} while (cur <= old);
old = cur;
}
WRITE_ONCE(shd_data[BIDX], shd_data[BIDX]+1);
rdtsc(tb_end);
return NULL;
}
void* pong(void *args)
{
int loop=0;
int old = 0;
int cur = 0;
memset(tsc_end, 0, MAX_LOOPS*sizeof(unsigned long));
memset(shd_data, 0, 64); //preheat
while (start_flag == 0) ; // sync. start
rdtsc(ts_start);
while (++loop < MAX_LOOPS) {
do {
cur = READ_ONCE(shd_data[BIDX]);
rdtsc(tsc_end[loop]);
} while (cur <= old);
old = cur;
WRITE_ONCE(shd_data[SIDX], shd_data[SIDX]+1);
}
WRITE_ONCE(shd_data[SIDX], shd_data[SIDX]+1);
rdtsc(ts_end);
return NULL;
}
The experimental machine is an AMD 3910X with gcc-9.4.0. I also ran the same code on an Intel i9-9900 CPU. Interestingly, monitoring the "ping" operation with the rdtsc code did not affect the overall ping-pong time, which remained around 400 cycles. I am not sure whether this phenomenon can be replicated on other AMD or Intel CPUs, as I only have access to these two machines at the moment.
Update:
By adjusting the CPU Frequency Scaling, the CPU frequency on my two machines is fixed at the base frequency of 3.6GHz.
you can find the complete runnable code from the link below:
#include
s and amain
, so people couldn't copy/paste and try it themselves without some work to fill in the gaps.) – Pledgetrdtsc()
defined? Does it include anlfence
to block out-of-order exec of the uops that read the TSC? That's not the name of a compiler intrinsic for the instruction. – Pledgetvolatile
accesses on GCC/Clang to roll its own equivalent forstd::atomic_ref
withmemory_order_relaxed
. This code compiles the same as if they'd usedatomic
withrelaxed
, and cache coherency means that stores on one core will invalidate caches on other cores before they can commit. Extra barriers won't make visibility happen sooner, just make threads stall – Pledget