Scalable allocation of large (8MB) memory regions on NUMA architectures

Asked 10/12, 2012 at 15:14 Answered 5/7, 2013 at 10:57

Solved c++memory-management parallel-processing tbb numa

We are currently using a TBB flow graph in which a) a parallel filter processes an array (in parallel with offsets) and puts processed results into an intermediate vector (allocated on the heap; mostly the vector will grow up to 8MB). These vectors are then passed to nodes which then postprocess these results based on their characteristics (determined in a)). Because of synchronized resources, there can only be one such node for each characteristic. The prototype we wrote works well on UMA architectures (tested on a single CPU Ivy Bridge and Sandy Bridge architecture). However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc of a 8MB chunk, then memset the 8MB region; similar to what the initial prototype would do) up to a certain amout of memory. Our findings are:

On a UMA architecture the application scales up linearly with the number of threads used by the pipeline (set via task_scheduler_init)
On the NUMA architecture when we pin the application to one socket (using numactl) we see the same linear scale-up
On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets (negative linear scale-"up")

For us this smells like heap contention. What we tried so far is to substitute Intel"s TBB scalable allocator for the glibc allocator. However, the initial performance on a single socket is worse than using glibc, on multiple sockets performance is not getting worse but also not getting any better. We gained the same effect using tcmalloc, the hoard allocator, and TBB's cache aligned allocator.

The question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated vectors even after the pipeline ran. How can one heap allocate memory regions in the size of MBs efficiently on NUMA architectures from multiple threads? We'd really like to keep a dynamic allocation approach instead of preallocating memory and managing it within the application.

I attached perf stats for the various executions with numactl. Interleaving/localalloc has no effect whatsoever (the QPI bus is not the bottleneck; we verified that with PCM, QPI link load is at 1%). I also added a chart depicting the results for glibc, tbbmalloc, and tcmalloc.

perf stat bin/prototype 598.867

Performance counter stats for 'bin/prototype':

  12965,118733 task-clock                #    7,779 CPUs utilized          
        10.973 context-switches          #    0,846 K/sec                  
         1.045 CPU-migrations            #    0,081 K/sec                  
       284.210 page-faults               #    0,022 M/sec                  
17.266.521.878 cycles                    #    1,332 GHz                     [82,84%]
15.286.104.871 stalled-cycles-frontend   #   88,53% frontend cycles idle    [82,84%]
10.719.958.132 stalled-cycles-backend    #   62,09% backend  cycles idle    [67,65%]
 3.744.397.009 instructions              #    0,22  insns per cycle        
                                         #    4,08  stalled cycles per insn [84,40%]
   745.386.453 branches                  #   57,492 M/sec                   [83,50%]
    26.058.804 branch-misses             #    3,50% of all branches         [83,33%]

   1,666595682 seconds time elapsed

perf stat numactl --cpunodebind=0 bin/prototype 272.614

Performance counter stats for 'numactl --cpunodebind=0 bin/prototype':

   3887,450198 task-clock                #    3,345 CPUs utilized          
         2.360 context-switches          #    0,607 K/sec                  
           208 CPU-migrations            #    0,054 K/sec                  
       282.794 page-faults               #    0,073 M/sec                  
 8.472.475.622 cycles                    #    2,179 GHz                     [83,66%]
 7.405.805.964 stalled-cycles-frontend   #   87,41% frontend cycles idle    [83,80%]
 6.380.684.207 stalled-cycles-backend    #   75,31% backend  cycles idle    [66,90%]
 2.170.702.546 instructions              #    0,26  insns per cycle        
                                         #    3,41  stalled cycles per insn [85,07%]
   430.561.957 branches                  #  110,757 M/sec                   [82,72%]
    16.758.653 branch-misses             #    3,89% of all branches         [83,06%]

   1,162185180 seconds time elapsed

perf stat numactl --cpunodebind=0-1 bin/prototype 356.726

Performance counter stats for 'numactl --cpunodebind=0-1 bin/prototype':

   6127,077466 task-clock                #    4,648 CPUs utilized          
         4.926 context-switches          #    0,804 K/sec                  
           469 CPU-migrations            #    0,077 K/sec                  
       283.291 page-faults               #    0,046 M/sec                  
10.217.787.787 cycles                    #    1,668 GHz                     [82,26%]
 8.944.310.671 stalled-cycles-frontend   #   87,54% frontend cycles idle    [82,54%]
 7.077.541.651 stalled-cycles-backend    #   69,27% backend  cycles idle    [68,59%]
 2.394.846.569 instructions              #    0,23  insns per cycle        
                                         #    3,73  stalled cycles per insn [84,96%]
   471.191.796 branches                  #   76,903 M/sec                   [83,73%]
    19.007.439 branch-misses             #    4,03% of all branches         [83,03%]

   1,318087487 seconds time elapsed

perf stat numactl --cpunodebind=0-2 bin/protoype 472.794

Performance counter stats for 'numactl --cpunodebind=0-2 bin/prototype':

   9671,244269 task-clock                #    6,490 CPUs utilized          
         7.698 context-switches          #    0,796 K/sec                  
           716 CPU-migrations            #    0,074 K/sec                  
       283.933 page-faults               #    0,029 M/sec                  
14.050.655.421 cycles                    #    1,453 GHz                     [83,16%]
12.498.787.039 stalled-cycles-frontend   #   88,96% frontend cycles idle    [83,08%]
 9.386.588.858 stalled-cycles-backend    #   66,81% backend  cycles idle    [66,25%]
 2.834.408.038 instructions              #    0,20  insns per cycle        
                                         #    4,41  stalled cycles per insn [83,44%]
   570.440.458 branches                  #   58,983 M/sec                   [83,72%]
    22.158.938 branch-misses             #    3,88% of all branches         [83,92%]

   1,490160954 seconds time elapsed

Minimal example: compiled with g++-4.7 std=c++11 -O3 -march=native; executed with numactl --cpunodebind=0 ... numactl --cpunodebind=0-3 - with CPU binding we have the following finding: 1 CPU (speed x), 2 CPUs (speed ~ x/2), 3 CPUs (speed ~ x/3) [speed=the higher the better]. So what we see is that performance worsens with the number of CPUs. Memory binding, interleaving (--interleave=all) and --localalloc have no effect here (we monitored all QPI links and link load was below 1% for each link).

#include <tbb/pipeline.h>
#include <tbb/task_scheduler_init.h>
#include <chrono>
#include <stdint.h>
#include <iostream>
#include <fcntl.h>
#include <sstream>
#include <sys/mman.h>
#include <tbb/scalable_allocator.h>
#include <tuple>

namespace {
// 8 MB
size_t chunkSize = 8 * 1024 * 1024;
// Number of threads (0 = automatic)
uint64_t threads=0;
}

using namespace std;
typedef chrono::duration<double, milli> milliseconds;

int main(int /* argc */, char** /* argv */)
{
   chrono::time_point<chrono::high_resolution_clock> startLoadTime = chrono::high_resolution_clock::now();
   tbb::task_scheduler_init init(threads==0?tbb::task_scheduler_init::automatic:threads);
   const uint64_t chunks=128;
   uint64_t nextChunk=0;
   tbb::parallel_pipeline(128,tbb::make_filter<void,uint64_t>(
         tbb::filter::serial,[&](tbb::flow_control& fc)->uint64_t
   {
      uint64_t chunk=nextChunk++;
      if(chunk==chunks)
         fc.stop();

      return chunk;
   }) & tbb::make_filter<uint64_t,void>(
         tbb::filter::parallel,[&](uint64_t /* item */)->void
   {
        void* buffer=scalable_malloc(chunkSize);
        memset(buffer,0,chunkSize);
   }));

   chrono::time_point<chrono::high_resolution_clock> endLoadTime = chrono::high_resolution_clock::now();
   milliseconds loadTime = endLoadTime - startLoadTime;
   cout << loadTime.count()<<endl;
}

Discussion on Intel TBB forums: http://software.intel.com/en-us/forums/topic/346334

Eagleeyed answered 10/12, 2012 at 15:14 Comment(18)

What is the criteria for postprocessing? Do all the threads modify in-place or could they be given copies of the source data and the vector. On multiple socket machines it's often much faster/easier to take a shared memory/multiprocessing approach rather than threading if there is contention. With multiple processes you could set the CPU affinity to keep contention to a minimum. – Orthorhombic 10/12, 2012 at 17:44

The issue here is independent from the source array (and even postprocessing; its just to give the context why we need heap allocation). I created a minimal example where 8MB chunks are allocated on the heap up to a certain size (in parallel). What we see is that with one CPU it takes x ms, with 2 CPUs it takes roughly 2*x ms, .... So it does not scale with the number of sockets on NUMA architectures. It does however scale on one socket with the number of threads. – Eagleeyed 10/12, 2012 at 17:50

Could you provide a minimal pseudocode example on how you allocate this memory, fill it up with data and then process and postprocess? Which QPI links have you been monitoring? A 4-socket system would be fully connected with 6 bi-directional links. Also you have many CPU migrations - bind each thread to a different CPU core and also observe the amount of remote (in the NUMA sense) memory accesses. – Quintus 10/12, 2012 at 18:55

I added the code to reproduce the issue. We monitored all QPI links, link load was below 1% for all of them. I haven't fully verified but think that thread binding has no effect here. This is more an issue of heap contention or better said a problem with the synchronization needed to allocate big (here 8MiB) memory regions on the heap. – Eagleeyed 10/12, 2012 at 19:33

I don't know of any allocators which are optimized for allocating such large chunks concurrently. At this point, they might be taking them from the system on-demand, in which case I'd imagine they've got a lock around a sbrk call which is why you're not seeing any concurrency. Interesting question, looking forward to seeing the answers. +1. – Spondee 10/12, 2012 at 19:49

Have you tried allocating a large block of memory up front, and having each NUMA node use an offset into it? It would remove any possible locking on sbrk. I'm not sure how *nix allocation works -- if it doesn't materialize those pages until you first use them, this could work. – Spondee 10/12, 2012 at 19:56

8 MiB, if allocated in a single malloc/new call, are anonymously mmap()-ed and not taken from the data segment (a.k.a. the sbrk heap) – Quintus 10/12, 2012 at 20:6

@CoryNelson: "We'd really like to keep a dynamic allocation approach instead of preallocating memory and managing it within the application." -- Just to give you some numbers: I preallocated memory using a tbb::fixed_pool and malloc'ed from there. This approach indeed scales on NUMA. However on Linux (speaking of Kernel 2.6+, haven't tried on earlier) one needs to preallocate and memset to actually have a hold of the memory. For me this means preallocating at least 2GB and this takes a whole lot of time. – Eagleeyed 10/12, 2012 at 20:14

I am unable to reproduce your scaling behaviour on an 4+4-socket Nehalem-EX system. Sometimes I even get better performance with multiple NUMA nodes. g++ 4.7.1, OS is 64-bit Scientific Linux 6.3, kernel 2.6.32-279.14.1.el6.x86_64 (the one that comes with the distribution, I believe). – Quintus 10/12, 2012 at 20:57

On the contrary, with the usual malloc from glibc, I get: ~300 ms with 1 NUMA node, ~180 ms with 2 NUMA nodes, ~180 ms with 3 NUMA nodes, ~210 ms with 4 NUMA nodes, ~360 ms with 8 NUMA nodes. – Quintus 10/12, 2012 at 21:5

@HristoIliev: This is interesting. We also have 4 socket Nehalem EX system; kernel 3.5.0. More specifically 4 X7560 CPUs, 1TB main memory (256GiB per socket) and the 5520/5500/X58 chipset. I get 178.021 ms with 1 NUMA node, 338.373 ms with 2 NUMA nodes, 455.898 ms with 3 NUMA nodes, 561.749 ms with 4 NUMA nodes. I made multiple measurements, all with similar results. This is with TBB scalable_allacator. I'll just do another round with the glibc allocator. – Eagleeyed 10/12, 2012 at 21:7

Just verified; similar behavior with glibc malloc for me. – Eagleeyed 10/12, 2012 at 21:12

~420 ms on a 4x4-socket Nehalem-EX system (4 boards connected with Bull Coherent Switch XQPI router). Same processors (X7550) as the 8-socket system. Note that perf is messing badly with the run times. May be you should run your code under some kind of a thread profiler/analyser. – Quintus 10/12, 2012 at 21:16

The numbers I posted were taken without perf. I already profiled with different tools (perf, VTune amplifier, ...). The performance/parallelization bottleneck are definitely the multithreaded malloc calls. Another theory: how much main memory does your system configuration have? – Eagleeyed 10/12, 2012 at 21:23

Both systems have 256 GiB RAM distributed evenly between all NUMA nodes. I can test on systems with up to 2 TiB RAM, but they are on a tight schedule and it would take time. – Quintus 10/12, 2012 at 22:3

It'd be great to see numbers from the machines with more RAM. I'm limited to the NUMA machine mentioned above. My hypothesis is that the amount of RAM per socket is influencing the parallel dynamic allocation performance. However just a guess so far ... – Eagleeyed 10/12, 2012 at 22:18

You are finding that the amount of time it takes for your program to run ends up being dominated by the amount of memory your program allocates, and adding more threads doing the allocation does not seem to improve performance? Look at this: download.intel.com/technology/itj/2007/v11i4/5-foundations/… via #658283 -- large allocations bypass the scalable allocator in that lib. I do not see where the process affinity of allocated memory is indicated by you? – Larner 17/12, 2012 at 20:16

@Yakk: true, all scalable allocators I know of bypass their implementations and call mmap for large allocations. Thus they are all affected by the scalability issue which is related to the page table spin lock on NUMA architectures. – Eagleeyed 22/1, 2013 at 10:17

Second Update (closing the question):

Just profiled the example application again with a 3.10 kernel.

Results for parallel allocation and memsetting of 16GB of data:

small pages:

1 socket: 3112.29 ms
2 socket: 2965.32 ms
3 socket: 3000.72 ms
4 socket: 3211.54 ms

huge pages:

1 socket: 3086.77 ms
2 socket: 1568.43 ms
3 socket: 1084.45 ms
4 socket: 852.697 ms

The scalable allocation problem seems to be fixed now - at least for huge pages.

Eagleeyed answered 5/7, 2013 at 10:57 Comment(0)

A short update and a partial answer for the described issue: The call to malloc or scalable_malloc are not the bottleneck, the bottleneck are rather the page faults triggered by memsetting the allocated memory. There is no difference between glibc malloc and other scalable allocators such as Intel's TBB scalable_malloc: for allocations larger than a specific threshold (usually 1MB if nothing is freed; can be defined by madvise) the memory will be allocated by an anoymous mmap. Initially all pages of the map point to a kernel-interneal page that is pre-0ed and read-only. When we memset the memory, this triggers an exception (mind the kernel page is read-only) and a page fault. A new page will be 0ed at this time. Small pages are 4KB so this will happen 2048 times for the 8MB buffer we allocate and write. What I measured is that these page faults are not so expensive on single-socket machines but get more and more expensive on NUMA machines with multiple CPUs.

Solutions I came up so far:

Use huge pages: helps but only delays the problem
Use a preallocated and pre-faulted (either memset or mmap + MAP_POPULATE) memory region (memory pool) and allocate from there: helps but one not necessarily wants to do that
Address this scalability issue in the Linux kernel

Eagleeyed answered 18/12, 2012 at 13:10 Comment(15)

Sorry, I'm still unable to test your code on our large memory machines as there are only three of them and a lot of users in the queue. But I wrote a simple OpenMP program that allocates chunks of 8 MiB and memsets them to 0. It does that twice and measures the ratio of the time it takes to touch the memory and to rewrite it. With a single core-bound thread the ratio is 2.1x. With 8 threads on a single socket it is 1.3x (bandwidth limited). With 16 threads (2 sockets), the ratio is 2.1x. With 32 threads (4 sockets), the ratio is 3.6x. With 64 threads (2 boards), it is 5.1x. – Quintus 18/12, 2012 at 17:26

Forgot to mention, that the rewrite speed scales almost linearly with the number of CPU sockets. It gets worse, really, but... not that worse. These are two separate 4-socket system boards (out of four) connected via a special QPI bus router with relatively long external cables (compared to the distances on a single system board) that run behind the systems. Remote memory accesses are really expensive and slow (hence all synchronisation primitives too) and yet the first touch with 64 threads is only 4.6x slower compared to the 8-threads case. – Quintus 18/12, 2012 at 18:6

Just to get your test right: you allocated memory, touched it (pre-faulting) and then wrote it; then you measured the ratio of time it took to touch the memory and to write it. So to speak on a single-core-bound thread it took 2.1x longer to touch it the first time than it took to write it afterwards. Is this understanding correct? – Eagleeyed 19/12, 2012 at 11:25

Yes, the first part is malloc() + memset() run in a loop in each thread. The second part is only memset() run in a loop with each thread only rewriting these blocks that it had previously allocated. So it actually measures "allocate + touch" / "rewrite" time ratio but I can easily change it to perform a separate allocation step. – Quintus 19/12, 2012 at 11:45

Thx. So this already helps me a lot as you were able to reproduce what I measured. In your case performance stagnates with 2 sockets and then decreases; I already measured the decrease with just 2 sockets but that might be due to different hardware configurations. So to conclude, page fault handling (touching first time) does not scale well enough for a single process (I ran experiments and for multiple processes it does). The bottleneck seems not to be the 0ing of a page when touching the first time (writing already touched pages scales) but the synchronization on the process' VM page table. – Eagleeyed 19/12, 2012 at 13:39

Note that I also run a test case with 8 threads, one per socket, and it produced a higher ratio, i.e. first touch in this case is more expensive than in the 8 threads on a single socket case. I strongly suspect that the process page table resides on a single NUMA node only and this further worsens things up as any TLB miss leads to remote memory fetches. Also the first CoW page faults result into expensive inter-socket TLB shootdowns. – Quintus 19/12, 2012 at 13:55

Yes, indeed, I measured the same effect. I assume it's the process page table's spin_lock (not 100% sure yet). There seems to be some work going on related to hierarchical scalable spin_locks for scalable page fault handling. – Eagleeyed 19/12, 2012 at 15:34

Have you tried calloc instead of malloc + memset? I could easily imagine a system whereby a "page that is all zero" allocation doesn't actually require touching the memory (it would be zeroed when first touched). See: #2688966 – Larner 24/12, 2012 at 4:14

Yes, it makes no difference if you use calloc instead of malloc+memset --- it's a synchronization scalability problem of page exceptions (page table spinlock) that is showing its severity on NUMA systems. – Eagleeyed 9/1, 2013 at 11:53

That's right, initializing lots of RAM takes lots of time. Can you structure your algorithm to not require initialization? Also, be sure that different threads never try to touch the same cache line (64-byte segment), as they can slow each other down. With a bit of mechanical sympathy I think you'll get things much faster. – Truncate 15/1, 2013 at 6:52

@doug65536: see post you commented on. Huge pages help but don't solve the problem. The page table spinlock is the bottleneck here. – Eagleeyed 22/1, 2013 at 10:16

@RandallCook: of course we could initialize memory up front and not pay the cost during the algorithm but that does not solve the problem itself---there is a point where you have to pay for initialization. We tried cache alignment to avoid cache thrashing. – Eagleeyed 22/1, 2013 at 10:22

@Eagleeyed Won't the hold time be 1/1024th what it would be with normal pages? – Authority 22/1, 2013 at 19:23

@Authority well one would expect it to be 1/512th (4kB vs. 2MB) with huge pages but it is not. What we see is cache thrashing which happens when a page exception is raised (when memory is first touched aka written and 0ed by the kernel). On NUMA systems cache thrashing has orders of magnitude worse effects than on SMP machines. – Eagleeyed 23/1, 2013 at 16:29

@Eagleeyed Oh sorry I misread your question. I hadn't realized you tried large pages already. – Authority 23/1, 2013 at 18:11