memset in parallel with threads bound to each physical core
Asked Answered
A

1

7

I have been testing the code at In an OpenMP parallel code, would there be any benefit for memset to be run in parallel? and I'm observing something unexpected.

My system is a single socket Xeon E5-1620 which is an Ivy Bridge processor with 4 physical cores and eight hyper-threads. I'm using Ubuntu 14.04 LTS, Linux Kernel 3.13, GCC 4.9.0, and EGLIBC 2.19. I compile with gcc -fopenmp -O3 mem.c

When I run the code in the link it defaults to eight threads and gives

Touch:   11830.448 MB/s
Rewrite: 18133.428 MB/s

However, when I bind the threads and set the number of threads to the number of physical cores like this

export OMP_NUM_THREADS=4 
export OMP_PROC_BIND=true

I get

Touch:   22167.854 MB/s
Rewrite: 18291.134 MB/s

The touch rate has doubled! Running several times after binding always has touch faster than rewrite. I don't understand this. Why is touch faster than rewrite after binding the threads and setting them to the number of physical cores? Why has the touch rate doubled?

Here is the code I used taken without modification from Hristo Iliev answer.

#include <stdio.h>
#include <string.h>
#include <omp.h>

void zero(char *buf, size_t size)
{
    size_t my_start, my_size;

    if (omp_in_parallel())
    {
        int id = omp_get_thread_num();
        int num = omp_get_num_threads();

        my_start = (id*size)/num;
        my_size = ((id+1)*size)/num - my_start;
    }
    else
    {
        my_start = 0;
        my_size = size;
    }

    memset(buf + my_start, 0, my_size);
}

int main (void)
{
    char *buf;
    size_t size = 1L << 31; // 2 GiB
    double tmr;

    buf = malloc(size);

    // Touch
    tmr = -omp_get_wtime();
    #pragma omp parallel
    {
        zero(buf, size);
    }
    tmr += omp_get_wtime();
    printf("Touch:   %.3f MB/s\n", size/(1.e+6*tmr));

    // Rewrite
    tmr = -omp_get_wtime();
    #pragma omp parallel
    {
        zero(buf, size);
    }
    tmr += omp_get_wtime();
    printf("Rewrite: %.3f MB/s\n", size/(1.e+6*tmr));

    free(buf);

    return 0;
}

Edit: Without tread binding but using four threads here are the results running eight times.

Touch:   14723.115 MB/s, Rewrite: 16382.292 MB/s
Touch:   14433.322 MB/s, Rewrite: 16475.091 MB/s 
Touch:   14354.741 MB/s, Rewrite: 16451.255 MB/s  
Touch:   21681.973 MB/s, Rewrite: 18212.101 MB/s 
Touch:   21004.233 MB/s, Rewrite: 17819.072 MB/s 
Touch:   20889.179 MB/s, Rewrite: 18111.317 MB/s 
Touch:   14528.656 MB/s, Rewrite: 16495.861 MB/s
Touch:   20958.696 MB/s, Rewrite: 18153.072 MB/s

Edit:

I tested this code on two other systems and I can't reproduce the problem on them

i5-4250U (Haswell) - 2 physical cores, 4 hyper-threads

4 threads unbound
    Touch:   5959.721 MB/s, Rewrite: 9524.160 MB/s
2 threads bound to each physical core
    Touch:   7263.175 MB/s, Rewrite: 9246.911 MB/s

Four socket E7- 4850 - 10 physical cores, 20 hyper-threads each socket

80 threads unbound
    Touch:   10177.932 MB/s, Rewrite: 25883.520 MB/s
40 threads bound
    Touch:   10254.678 MB/s, Rewrite: 30665.935 MB/s

This shows that binding the threads to the physical cores does improve the both touch and rewrite but touch is slower than rewrite on these two systems.

I also tested three different variations of memset: my_memset, my_memset_stream, and A_memset. The functions my_memset and my_memset_stream are defined below. The function A_memset comes from Agner Fog's asmlib.

my_memset results:

Touch:   22463.186 MB/s
Rewrite: 18797.297 MB/s

I think this shows that the problem is not in EGLIBC's memset function.

A_memset results:

Touch:   18235.732 MB/s
Rewrite: 44848.717 MB/s

my_memset_stream:

Touch:   18678.841 MB/s
Rewrite: 44627.270 MB/s

Looking at the source code of the asmlib I saw that for writing large chuncks of memory that non temporal stores are used. That's why my_memset_stream get's about the same bandwidth as Agner Fog's asmlib. The maximum throughput of this system is 51.2 GB/s. So this show that A_memset and my_memset_stream get about 85% of that maximum throughput.

void my_memset(int *s, int c, size_t n) {
    int i;
    for(i=0; i<n/4; i++) {
        s[i] = c;
    }
}

void my_memset_stream(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);

    for(i=0; i<n/4; i+=4) {
        _mm_stream_si128((__m128i*)&s[i], v);
    }
}
Aquiver answered 25/8, 2014 at 9:58 Comment(12)
What about 4 threads without OMP_PROC_BIND?Nightingale
@HristoIliev, I added eight runs to the end of my answer without thread binding but with four threads.Aquiver
@HristoIliev, it's stable when the threads are bound at roughly 22 GB/s for touch and 18 GB/s for rewrite. But it's unstable when the threads are not bound (as you can see in the edit to my question).Aquiver
I'm confused. This absolutely makes no sense given that the thread team is created in the first parallel region. It could have something to do with the timer source used by omp_get_wtime() (CLOCK_MONOTONIC in recent libgomp versions). Try running it through LIKWID or similar profiling tool and see what memory speeds it reports or try measuring the time in a different way.Nightingale
agree, beside thread creation, the memory pages are initialized on the 1st touch. There is just no reason for the same code on the same threads over the same data to be executed slower. Except probably some Turbo Boost effects? Otherwise it looks like a bugHabitant
@HristoIliev, I can't reproduce the problem on two other systems. I added some more code and tests to my question. I did reproduce the problem with a simple function I call my_memset which I added to the question.Aquiver
I'm still in favour of the broken clock hypothesis. Try replacing omp_get_wtime() with clock_gettime(CLOCK_REALTIME). Also try varying the size of the memory block.Nightingale
I tried clock_gettime. It made no difference. However, adjusting the memory block size appears to: 256 MB and below gives more reasonable results but I still get touch a bit faster sometimes. I'm not sure the time is long enough for good stats. It only takes a few milliseconds: here were some results Touch: 13850.581 MB/s Rewrite: 18211.078 MB/s Touch: 16970.636 MB/s Rewrite: 17491.066 MB/s Touch: 17114.322 MB/s Rewrite: 18293.335 MB/s Touch: 18743.756 MB/s Rewrite: 17135.159 MB/sAquiver
Did you try running the program with LIKWID? I would also enable profiling of the TLB cache misses (doable with LIKWID).Nightingale
@HristoIliev, I tried a new Linux install on the same hardware. It made no difference. I also tried gettimeofday. That also made no difference. I'm still playing with LIKWID. When I did a bandwidth test with LIKWID it report around 21 GB/s.Aquiver
What kernel does the new Linux install have? Try comparing 2.6.x versus 3.x. Also use the LIKWID marker API to divide the code into two measurement regions and then record the memory bandwidth and TLB cache misses for each region separately.Nightingale
@HristoIliev, I have tested on Linux Kernel 3.13 and 3.16. I'll do some tests with LIKWID. Thanks for pointing me to LIKWID.Aquiver
C
0

It would appear from your numbers that your 4 bound threads are running on 2 physical cores instead of the expected 4 physical cores. Can you confirm this? It would explain the doubling of the Touch times. I'm not sure how to force a thread to a physical core when using hyperthreading on your system. {I tried adding this as a question, but have insufficient "reputation"}

Copywriter answered 26/8, 2014 at 21:46 Comment(7)
The default topology for Linux with Intel processors (as far as I have seen so far) is scattered. That means in my case the first four logical are physical cores and the next four are the hyper threads. I can use GOMP_CPU_AFFINITY to set this so GOMP_CPU_AFFINITY="0 1 2 3" should be the physical cores or "4 6 7 8". If I want to run four threads on two cores I could do "0 4 1 5". If I do that I get rates like "Touch: 17219.149 MB/s Rewrite: 17595.210 MB/s"..let me start a new comment...Aquiver
I have written my own binding tool which reads the apicid from CPUID for each thread and then I bind the threads to the even values. I get the same problem. If I do `cat /proc/cpuinfo | grep "initial apicid" it returns 0 2 4 6 1 3 5 7. The odd values are the hyper-threads so that shows that the first four logical processors are the physical cores.Aquiver
So I can either do OMP_PROC_BIND=true which will bind to the physical cores or I can do GOMP_CPU_AFFINITY="0 1 2 3". However, on windows it uses a compact topology. So I would have to do GOMP_CPU_AFFINITY="0 4 6 8" to bind to each physical core on Windows. But since MSVC does not support this I do it myself by reading CPUID so my code works on Linux and Windows. Incidentally, I don't see the rewrite doubling problem on Windows with MSVC. But then the measured bandwidth on Windows using MSVC's implementation of memset is not very good anyway.Aquiver
To be certain, I just disabled hyper-threading the the BIOS. I still get the same problem.Aquiver
This has nothing to do with the placement of the threads on the physical cores as long as it is the same for both parallel regions. It simply makes no sense to have the initial touch be faster than the consecutive write to the already mapped pages. This could only happen if part (or all) of the memory gets swapped somewhere between the two measurements or if the TLB misses are extremely expensive (i.e. loading a PTE into the TLB should be more expensive than creating the PTE).Nightingale
@HristoIliev, I totally agree it makes no sense. I'm going to install Linux to a USB, boot and install GCC and then run the code. If that fixes the problem then I know it's something to do with my install. If it does not then I will try LIKWID. I have not used LIKWID before.Aquiver
@HristoIliev, apparently, when I disabled hyper-threading now the error always happens (even without explicitly binding). I downloaded and compiled LIKWID. I'll let you know if I find anything.Aquiver

© 2022 - 2024 — McMap. All rights reserved.