I'm working on a legacy application initially developed for multicore processor systems. To leverage multicore processing OpenMP and PPL have been used. Now a new requirement is to run the software on systems with more than one NUMA-node. The targeted OS is Windows 7 x64.
I've performed several measurements and noticed that the execution time has been optimal when assigning the application to a single NUMA node and therefore wasting a complete processor. Many parts of the application perform data-parallel algorithms where for example every element of a vector is processed in parallel and the result is written to another vector as in following example
std::vector<int> data;
std::vector<int> res;
// init data and res
#pragma omp parallel for
for (int i = 0; i < (int) data.size(); ++i)
{
res[i] = doExtremeComplexStuff(data[i]);
}
As far as I can tell the drop in performance in such algorithms is caused by non-local memory access from a second NUMA-Node. So the question is how to make the application perform better.
Are read-only accesses to non-local memory somehow transparently accelerated (e.g. by the OS copying data from one node's local memory to another node's local memory)? Would I have to split the problem size and copy the input data to the respective NUMA-node, process it and afterwards combine the data of all NUMA-nodes again to improve performance?
If this is the case, are there alternatives to the std containers since these are not NUMA-aware when allocating memory?
numactl
strategies?numactl --interleave=all
sometimes helps. – Brierrootstatic
scheduling, which may even not be the default scheduling policy. – Brierrootgcc
,clang
, andicc
do use static as default. – Agni