C++ NUMA Optimization

I'm working on a legacy application initially developed for multicore processor systems. To leverage multicore processing OpenMP and PPL have been used. Now a new requirement is to run the software on systems with more than one NUMA-node. The targeted OS is Windows 7 x64.

I've performed several measurements and noticed that the execution time has been optimal when assigning the application to a single NUMA node and therefore wasting a complete processor. Many parts of the application perform data-parallel algorithms where for example every element of a vector is processed in parallel and the result is written to another vector as in following example

std::vector<int> data;
std::vector<int> res;

// init data and res

#pragma omp parallel for
for (int i = 0; i < (int) data.size(); ++i)
{  
  res[i] = doExtremeComplexStuff(data[i]);
}

As far as I can tell the drop in performance in such algorithms is caused by non-local memory access from a second NUMA-Node. So the question is how to make the application perform better.

Are read-only accesses to non-local memory somehow transparently accelerated (e.g. by the OS copying data from one node's local memory to another node's local memory)? Would I have to split the problem size and copy the input data to the respective NUMA-node, process it and afterwards combine the data of all NUMA-nodes again to improve performance?

If this is the case, are there alternatives to the std containers since these are not NUMA-aware when allocating memory?

When you allocate dynamic memory (such as std::vector does) you effectively get some range of pages from virtual memory space. When a program first accesses a particular page, page fault is triggered and some page from physical memory is requested. Usually, this page is in a local physical memory to the core that generated the page fault, which is called a first touch policy.

In your code, if pages of your std::vector's buffers are first touched by a single (e.g, main) thread, then it may happen that all elements of these vectors ends up in a local memory of a single NUMA node. Then, if you split your program to threads that runs on all NUMA nodes, some of the threads accesses remote memory when working with these vectors.

The solution is thus to allocate "raw memory" and then "touch" it first with all threads the same way it will be then accessed by these threads during processing phase. Unfortunately, this is not easy to achieve with std::vector, at least with standard allocators. Can you switch to ordinary dynamic arrays? I would try this first to find out, whether their initialization with respect to first touch policy helps:

int* data = new int[N];
int* res = new int[N];

// initialization with respect to first touch policy
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
   data[i] = ...;
   res[i] = ...;
}

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
   res[i] = doExtremeComplexStuff(data[i]);

With static scheduling, mapping of elements to threads should the very same in both loops.

However, I am not convinced that your problem is caused by NUMA effects when accessing these two vectors. As you called the function doExtremeComplexStuff, it seems that this function is very expensive as for runtime. If this is true, even accessing remote NUMA memory will likely be negligibly fast in comparison with function invocation. The whole problem can be hidden inside this function, but we don't know what it does.

Recommended topics

Hot tags