The overhead-free monitor codes in the AMD CPU significantly increases the total synchronization duration
Asked Answered
C

1

2

I am conducting a test to measure the message synchronization latency between different cores of a CPU. Specifically, I am measuring how many clock cycles it takes for CPU2 to detect changes in the shared data made by CPU1. Both CPU1 and CPU2 are using rdtsc instruction to record the timing. I have observed inconsistent behavior between Intel and AMD CPU platforms. Any thoughts or suggestions you may have regarding this issue is welcome.

The code is as follows. The program contains two threads, "ping" and "pong," running on CPU1 and CPU2 respectively. As their names suggest, they take turns incrementing their own shared data (shd_data) to implement a ping-pong loop and finally measure the total running time (the minimum of ts_end-ts_start and tb_end-tb_start). In order to also measure the average time for the "ping" operations, I added rdtsc(tsc_start[loop]); and rdtsc(tsc_end[loop]); within the while loop. However, these two lines of monitor code significantly affected the total running time, increasing the average ping-pong loop time from about 180 cycles to 290 cycles. I cannot find a reasonable explanation for this.

#define MAX_LOOPS  10
#define BIDX 0
#define SIDX 15

static volatile int start_flag = 0;
static int shd_data[16];
unsigned long tb_start , tb_end, ts_start, ts_end;
unsigned long gap[12];
unsigned long tsc_start[MAX_LOOPS];
unsigned long tsc_end[MAX_LOOPS];

void* ping(void *args)
{
    int loop=0;
    int old = 0;
    int cur = 0;

    memset(tsc_start, 0, MAX_LOOPS*sizeof(unsigned long));
    memset(shd_data, 0, 64);    // preheat
    while (start_flag == 0) ;   // sync. start

    rdtsc(tb_start);
    while (++loop < MAX_LOOPS) {
        rdtsc(tsc_start[loop]);
        WRITE_ONCE(shd_data[BIDX], shd_data[BIDX]+1);
        do {
            cur = READ_ONCE(shd_data[SIDX]);
        } while (cur <= old);
        old = cur;
    }
    WRITE_ONCE(shd_data[BIDX], shd_data[BIDX]+1);
    rdtsc(tb_end);

    return NULL;
}


void* pong(void *args)
{
    int loop=0;
    int old = 0;
    int cur = 0;
    
    memset(tsc_end, 0, MAX_LOOPS*sizeof(unsigned long));
    memset(shd_data, 0, 64);    //preheat
    while (start_flag == 0) ;   // sync. start

    rdtsc(ts_start);
    while (++loop < MAX_LOOPS) {
        do {
            cur = READ_ONCE(shd_data[BIDX]);
            rdtsc(tsc_end[loop]);
        } while (cur <= old);
        old = cur;
        WRITE_ONCE(shd_data[SIDX], shd_data[SIDX]+1);
    }
    WRITE_ONCE(shd_data[SIDX], shd_data[SIDX]+1);
    rdtsc(ts_end);

    return NULL;
}

The experimental machine is an AMD 3910X with gcc-9.4.0. I also ran the same code on an Intel i9-9900 CPU. Interestingly, monitoring the "ping" operation with the rdtsc code did not affect the overall ping-pong time, which remained around 400 cycles. I am not sure whether this phenomenon can be replicated on other AMD or Intel CPUs, as I only have access to these two machines at the moment.

Update:

By adjusting the CPU Frequency Scaling, the CPU frequency on my two machines is fixed at the base frequency of 3.6GHz.

you can find the complete runnable code from the link below:

https://onlinegdb.com/HoCs3AChhp

Chema answered 24/11, 2023 at 14:0 Comment(10)
Does this answer your question? READ_ONCE and WRITE_ONCE in Parallel programmingVickivickie
@tevemadar: That's not an answer to this performance question, it's just something you need to know to read this code which isn't a minimal reproducible example because (missing #includes and a main, so people couldn't copy/paste and try it themselves without some work to fill in the gaps.)Pledget
Did you do any warm-up to get CPU frequency up higher than idle? RDTSC counts reference cycles, not core clock cycles. How to get the CPU cycle count in x86_64 from C++? . Also, on Intel client (non-Xeon) CPUs, the uncore (interconnect between cores) only runs as fast as the max non-turbo clock that any core is running at, I think, or maybe is fully independent. See Slowing down CPU Frequency by imposing memory stress - hardware P-state management on Intel CPUs will lower clock speeds when code is memory-boundPledget
How is your rdtsc() defined? Does it include an lfence to block out-of-order exec of the uops that read the TSC? That's not the name of a compiler intrinsic for the instruction.Pledget
@PeterCordes in my understanding: this code doesn't have synchronization, it relies on occasional cache flushes only. They are likely triggered by everything else that the machine is doing, and presumably the larger the number of cores and the larger the per-core cache are, the less frequent those flushes will be.Vickivickie
@tevemadar: Cache is coherent, see When to use volatile with multi threading? . That's why the Linux kernel can use volatile accesses on GCC/Clang to roll its own equivalent for std::atomic_ref with memory_order_relaxed. This code compiles the same as if they'd used atomic with relaxed, and cache coherency means that stores on one core will invalidate caches on other cores before they can commit. Extra barriers won't make visibility happen sooner, just make threads stallPledget
@PeterCordes Sorry for the late response as the earth is not flat. And I have updated the question as a supplement to your doubts. Adding lfence instructions may slightly increase the overall execution time, but it does not affect the observed phenomenon.Chema
@Vickivickie Thanks for your response, I have updated the question, adding a link of the complete codes.Chema
Are you running this inside a VM on either or both machines? RDTSC can be a vmexit if the CPU doesn't support offset and scale factors for virtualizing the TSC. I'm pretty sure your Intel CPU supports that but IDK about your AMD.Pledget
All tests were conducted on real physical machines, not VMs.Chema
C
1

Brief answer: Cache line contention

The added monitoring code, such as the use of rdtsc() mentioned in the question, will result in changes to the layout of the variables in the data segment, leading to cache line contention during runtime. Modifying the value defined by "MAX_LOOPS" will also have the same effect. It should be noted that cache contention can also be caused by speculative execution.

How to fix

Adding __attribute__((aligned(64))) to the definition of the shd_data[32] ensures that it is aligned on the cache line.

constraints

The situations mentioned above are specific to the AMD 3970X CPU and may also be applicable to the Zen2 architecture or other AMD CPUs.

I also conducted this experiment on an Intel i9-9900K, where the data segment layouts of the two programs were identical. However, I haven't observed cache thrashing on the Intel CPU, which is something I haven't fully understood yet.

Chema answered 1/12, 2023 at 12:12 Comment(1)
Maybe related: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? / Why does false sharing still affect non atomics, but much less than atomics? . On Intel check perf counters for machine_clears.memory_ordering since you're spin-waiting without _mm_pause()Pledget

© 2022 - 2024 — McMap. All rights reserved.