We are currently using a TBB flow graph in which a) a parallel filter processes an array (in parallel with offsets) and puts processed results into an intermediate vector (allocated on the heap; mostly the vector will grow up to 8MB). These vectors are then passed to nodes which then postprocess these results based on their characteristics (determined in a)). Because of synchronized resources, there can only be one such node for each characteristic. The prototype we wrote works well on UMA architectures (tested on a single CPU Ivy Bridge and Sandy Bridge architecture). However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc of a 8MB chunk, then memset the 8MB region; similar to what the initial prototype would do) up to a certain amout of memory. Our findings are:
On a UMA architecture the application scales up linearly with the number of threads used by the pipeline (set via task_scheduler_init)
On the NUMA architecture when we pin the application to one socket (using numactl) we see the same linear scale-up
On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets (negative linear scale-"up")
For us this smells like heap contention. What we tried so far is to substitute Intel"s TBB scalable allocator for the glibc allocator. However, the initial performance on a single socket is worse than using glibc, on multiple sockets performance is not getting worse but also not getting any better. We gained the same effect using tcmalloc, the hoard allocator, and TBB's cache aligned allocator.
The question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated vectors even after the pipeline ran. How can one heap allocate memory regions in the size of MBs efficiently on NUMA architectures from multiple threads? We'd really like to keep a dynamic allocation approach instead of preallocating memory and managing it within the application.
I attached perf stats for the various executions with numactl. Interleaving/localalloc has no effect whatsoever (the QPI bus is not the bottleneck; we verified that with PCM, QPI link load is at 1%). I also added a chart depicting the results for glibc, tbbmalloc, and tcmalloc.
perf stat bin/prototype 598.867
Performance counter stats for 'bin/prototype':
12965,118733 task-clock # 7,779 CPUs utilized
10.973 context-switches # 0,846 K/sec
1.045 CPU-migrations # 0,081 K/sec
284.210 page-faults # 0,022 M/sec
17.266.521.878 cycles # 1,332 GHz [82,84%]
15.286.104.871 stalled-cycles-frontend # 88,53% frontend cycles idle [82,84%]
10.719.958.132 stalled-cycles-backend # 62,09% backend cycles idle [67,65%]
3.744.397.009 instructions # 0,22 insns per cycle
# 4,08 stalled cycles per insn [84,40%]
745.386.453 branches # 57,492 M/sec [83,50%]
26.058.804 branch-misses # 3,50% of all branches [83,33%]
1,666595682 seconds time elapsed
perf stat numactl --cpunodebind=0 bin/prototype 272.614
Performance counter stats for 'numactl --cpunodebind=0 bin/prototype':
3887,450198 task-clock # 3,345 CPUs utilized
2.360 context-switches # 0,607 K/sec
208 CPU-migrations # 0,054 K/sec
282.794 page-faults # 0,073 M/sec
8.472.475.622 cycles # 2,179 GHz [83,66%]
7.405.805.964 stalled-cycles-frontend # 87,41% frontend cycles idle [83,80%]
6.380.684.207 stalled-cycles-backend # 75,31% backend cycles idle [66,90%]
2.170.702.546 instructions # 0,26 insns per cycle
# 3,41 stalled cycles per insn [85,07%]
430.561.957 branches # 110,757 M/sec [82,72%]
16.758.653 branch-misses # 3,89% of all branches [83,06%]
1,162185180 seconds time elapsed
perf stat numactl --cpunodebind=0-1 bin/prototype 356.726
Performance counter stats for 'numactl --cpunodebind=0-1 bin/prototype':
6127,077466 task-clock # 4,648 CPUs utilized
4.926 context-switches # 0,804 K/sec
469 CPU-migrations # 0,077 K/sec
283.291 page-faults # 0,046 M/sec
10.217.787.787 cycles # 1,668 GHz [82,26%]
8.944.310.671 stalled-cycles-frontend # 87,54% frontend cycles idle [82,54%]
7.077.541.651 stalled-cycles-backend # 69,27% backend cycles idle [68,59%]
2.394.846.569 instructions # 0,23 insns per cycle
# 3,73 stalled cycles per insn [84,96%]
471.191.796 branches # 76,903 M/sec [83,73%]
19.007.439 branch-misses # 4,03% of all branches [83,03%]
1,318087487 seconds time elapsed
perf stat numactl --cpunodebind=0-2 bin/protoype 472.794
Performance counter stats for 'numactl --cpunodebind=0-2 bin/prototype':
9671,244269 task-clock # 6,490 CPUs utilized
7.698 context-switches # 0,796 K/sec
716 CPU-migrations # 0,074 K/sec
283.933 page-faults # 0,029 M/sec
14.050.655.421 cycles # 1,453 GHz [83,16%]
12.498.787.039 stalled-cycles-frontend # 88,96% frontend cycles idle [83,08%]
9.386.588.858 stalled-cycles-backend # 66,81% backend cycles idle [66,25%]
2.834.408.038 instructions # 0,20 insns per cycle
# 4,41 stalled cycles per insn [83,44%]
570.440.458 branches # 58,983 M/sec [83,72%]
22.158.938 branch-misses # 3,88% of all branches [83,92%]
1,490160954 seconds time elapsed
Minimal example: compiled with g++-4.7 std=c++11 -O3 -march=native; executed with numactl --cpunodebind=0 ... numactl --cpunodebind=0-3 - with CPU binding we have the following finding: 1 CPU (speed x), 2 CPUs (speed ~ x/2), 3 CPUs (speed ~ x/3) [speed=the higher the better]. So what we see is that performance worsens with the number of CPUs. Memory binding, interleaving (--interleave=all) and --localalloc have no effect here (we monitored all QPI links and link load was below 1% for each link).
#include <tbb/pipeline.h>
#include <tbb/task_scheduler_init.h>
#include <chrono>
#include <stdint.h>
#include <iostream>
#include <fcntl.h>
#include <sstream>
#include <sys/mman.h>
#include <tbb/scalable_allocator.h>
#include <tuple>
namespace {
// 8 MB
size_t chunkSize = 8 * 1024 * 1024;
// Number of threads (0 = automatic)
uint64_t threads=0;
}
using namespace std;
typedef chrono::duration<double, milli> milliseconds;
int main(int /* argc */, char** /* argv */)
{
chrono::time_point<chrono::high_resolution_clock> startLoadTime = chrono::high_resolution_clock::now();
tbb::task_scheduler_init init(threads==0?tbb::task_scheduler_init::automatic:threads);
const uint64_t chunks=128;
uint64_t nextChunk=0;
tbb::parallel_pipeline(128,tbb::make_filter<void,uint64_t>(
tbb::filter::serial,[&](tbb::flow_control& fc)->uint64_t
{
uint64_t chunk=nextChunk++;
if(chunk==chunks)
fc.stop();
return chunk;
}) & tbb::make_filter<uint64_t,void>(
tbb::filter::parallel,[&](uint64_t /* item */)->void
{
void* buffer=scalable_malloc(chunkSize);
memset(buffer,0,chunkSize);
}));
chrono::time_point<chrono::high_resolution_clock> endLoadTime = chrono::high_resolution_clock::now();
milliseconds loadTime = endLoadTime - startLoadTime;
cout << loadTime.count()<<endl;
}
Discussion on Intel TBB forums: http://software.intel.com/en-us/forums/topic/346334
sbrk
call which is why you're not seeing any concurrency. Interesting question, looking forward to seeing the answers. +1. – Spondeesbrk
. I'm not sure how *nix allocation works -- if it doesn't materialize those pages until you first use them, this could work. – Spondeemalloc
/new
call, are anonymouslymmap()
-ed and not taken from the data segment (a.k.a. thesbrk
heap) – Quintusg++
4.7.1, OS is 64-bit Scientific Linux 6.3, kernel2.6.32-279.14.1.el6.x86_64
(the one that comes with the distribution, I believe). – Quintusmalloc
from glibc, I get: ~300 ms with 1 NUMA node, ~180 ms with 2 NUMA nodes, ~180 ms with 3 NUMA nodes, ~210 ms with 4 NUMA nodes, ~360 ms with 8 NUMA nodes. – Quintusperf
is messing badly with the run times. May be you should run your code under some kind of a thread profiler/analyser. – Quintus