I have blocks of memory that can be quite large (larger than the L2 cache), and sometimes I must set them to all zero. memset is good in a serial code, but what about parallel code ? Has somebody experience if calling memset from concurrent threads actually speed things up for large arrays ? Or even using simple openmp parallel for loops ?
People in HPC usually say that one thread is usually not enough to saturate a single memory link, the same usually being true for network links as well. Here is a quick and dirty OpenMP enabled memsetter I wrote for you that fills with zeros twice 2 GiB of memory. And here are the results using GCC 4.7 with different number of threads on different architectures (max values from several runs reported):
GCC 4.7, code compiled with -O3 -mtune=native -fopenmp
:
Quad-socket Intel Xeon X7350 - pre-Nehalem quad-core CPU with separate memory controller and Front Side Bus
single socket
threads 1st touch rewrite
1 1452.223 MB/s 3279.745 MB/s
2 1541.130 MB/s 3227.216 MB/s
3 1502.889 MB/s 3215.992 MB/s
4 1468.931 MB/s 3201.481 MB/s
(1st touch is slow as the thread team is being created from scratch and the operating system is mapping physical pages into the virtual address space reserved by malloc(3)
)
One thread already saturates the memory bandwidth of a single CPU <-> NB link. (NB = North Bridge)
1 thread per socket
threads 1st touch rewrite
1 1455.603 MB/s 3273.959 MB/s
2 2824.883 MB/s 5346.416 MB/s
3 3979.515 MB/s 5301.140 MB/s
4 4128.784 MB/s 5296.082 MB/s
Two threads are necessary to saturate the full memory bandwidth of the NB <-> memory link.
Octo-socket Intel Xeon X7550 - 8-way NUMA system with octo-core CPUs (CMT disabled)
single socket
threads 1st touch rewrite
1 1469.897 MB/s 3435.087 MB/s
2 2801.953 MB/s 6527.076 MB/s
3 3805.691 MB/s 9297.412 MB/s
4 4647.067 MB/s 10816.266 MB/s
5 5159.968 MB/s 11220.991 MB/s
6 5330.690 MB/s 11227.760 MB/s
At least 5 threads are necessary in order to saturate the bandwidth of one memory link.
1 thread per socket
threads 1st touch rewrite
1 1460.012 MB/s 3436.950 MB/s
2 2928.678 MB/s 6866.857 MB/s
3 4408.359 MB/s 10301.129 MB/s
4 5859.548 MB/s 13712.755 MB/s
5 7276.209 MB/s 16940.793 MB/s
6 8760.900 MB/s 20252.937 MB/s
Bandwidth scales almost linearly with the number of threads. Based on the single-socket observations one could say that at least 40 threads distributed as 5 threads per socket would be necessary in order to saturate all of the eight memory links.
Basic problem on NUMA systems is the first-touch memory policy - memory is allocated on the NUMA node where the thread first to touch a virtual address within a specific page executes. Thread pinning (binding to specific CPU cores) is essential on such systems as thread migration leads to remote access, which is slower. Supported for pinnig is available in most OpenMP runtimes. GCC with its libgomp
has the GOMP_CPU_AFFINITY
environment variable, Intel has the KMP_AFFINITY
environment variable, etc. Also, OpenMP 4.0 introduced the vendor-neutral concept of places.
Edit: For completeness, here are the results of running the code with an 1 GiB array on MacBook Air with Intel Core i5-2557M (dual-core Sandy Bridge CPU with HT and QPI). Compiler is GCC 4.2.1 (Apple LLVM build)
threads 1st touch rewrite
1 2257.699 MB/s 7659.678 MB/s
2 3282.500 MB/s 8157.528 MB/s
3 4109.371 MB/s 8157.335 MB/s
4 4591.780 MB/s 8141.439 MB/s
Why this high speed with even a single thread? A little exploration with gdb
shows that memset(buf, 0, len)
gets translated by the OS X compiler to bzero(buf, len)
and that an SSE4.2 enabled vectorised version by the name of bzero$VARIANT$sse42
is provided by libc.dylib
and used at run-time. It uses the MOVDQA
instruction to zero 16 bytes of memory at once. That's why even with one thread the memory bandwidth is almost saturated. A single-threaded AVX enabled version using VMOVDQA
can zero 32 bytes at once and probably saturate the memory link.
The important message here is that sometimes vectorisation and multithreading are not orthogonal in bringing speed-up to the operation.
taskset
and/or setting the GOMP_CPU_AFFINITY
variable. If you have hwloc
installed it provides the nifty hwloc-ls
tool. Just run it like hwloc-ls --taskset
and it will show you the necessary bitmask for taskset
to e.g. run on a single socket. –
Kyleekylen malloc
the memory is allocated using anonymous mmap
. This results in mapping in the virtual address space of the process but that mapping is still not backed by physical RAM frames, rather a special kernel page of all zeros is copy-on-write mapped everywhere within the region. Therefore reading from a freshly mmap-ed memory returns zeros. On first write to some address within that region a page fault occurs, the fault handler finds a free RAM frame and maps it to the corresponding page. –
Kyleekylen mmap(2)
to provide prefaulted memory (on Linux by MAP_POPULATE
; OS X does not support prefaulting). In the second case the call to mmap
will be extremely slow but there won't be any difference in the memory access between first touch and rewrite. –
Kyleekylen mmap
. I still have a lot to learn. –
Furnary export OMP_NUM_THREADS=4
and export OMP_PROC_BIND=true
. I get Touch: 21191.013 MB/s Rewrite: 18112.064 MB/s. But without binding and using eight threads I get Touch: 11830.490 MB/s Rewrite: 17933.885 MB/s. This makes no sense to me. –
Furnary Well, there's always the L3 cache...
However, it's very likely that this will be bound by main memory bandwidth already; adding more parallelism is unlikely to improve things.
© 2022 - 2024 — McMap. All rights reserved.
memset
on data out of cache will likely be bottlenecked by memory bandwidth. – Ermelindaermengardememset
in parallel on a NUMA machine (and all MP post-Core2 Intel systems as well as all MP and even some UP AMD systems are NUMA) could be your single most hard-to-understand-why performance killer, unless later on the same threads are going to access only those parts of the array that they have personally zeroed. – Kyleekylenmemset()
is SIMD-enabled in mostlibc
implementations and is already pushing the memory bandwidth to its peak. – Kyleekylen