C++ NUMA Optimization
Asked Answered
H

1

6

I'm working on a legacy application initially developed for multicore processor systems. To leverage multicore processing OpenMP and PPL have been used. Now a new requirement is to run the software on systems with more than one NUMA-node. The targeted OS is Windows 7 x64.

I've performed several measurements and noticed that the execution time has been optimal when assigning the application to a single NUMA node and therefore wasting a complete processor. Many parts of the application perform data-parallel algorithms where for example every element of a vector is processed in parallel and the result is written to another vector as in following example

std::vector<int> data;
std::vector<int> res;

// init data and res

#pragma omp parallel for
for (int i = 0; i < (int) data.size(); ++i)
{  
  res[i] = doExtremeComplexStuff(data[i]);
}

As far as I can tell the drop in performance in such algorithms is caused by non-local memory access from a second NUMA-Node. So the question is how to make the application perform better.

Are read-only accesses to non-local memory somehow transparently accelerated (e.g. by the OS copying data from one node's local memory to another node's local memory)? Would I have to split the problem size and copy the input data to the respective NUMA-node, process it and afterwards combine the data of all NUMA-nodes again to improve performance?

If this is the case, are there alternatives to the std containers since these are not NUMA-aware when allocating memory?

Hellhole answered 5/3, 2018 at 7:31 Comment(9)
I just happen to have this paper open in my browser: cs.brown.edu/~irina/papers/asplos2017-final.pdfQuarrier
Have you tried different numactl strategies? numactl --interleave=all sometimes helps.Brierroot
You are asking about OS numa policies without even telling us your OS (version), anything about your hardware, and only extremely little about your code. Any answer has to take wild guesses about your setup. Your kind of measurement is a good first start, but you have to dig deeper to really pinpoint the bottlenecks. Even a brilliant and detailed answer about best practices for NUMA handling may not even help you the slightest...Agni
Note that if all your loops were like you describe, then you would have no NUMA issues, because OpenMP guarantees that repeated loops, even with a different body, will distribute the indexes the same way among threads if the size is the same.Agni
@Agni Note that this is valid only for static scheduling, which may even not be the default scheduling policy.Brierroot
@DanielLangr true - the standard doesn't give a guarantee if static is not specified. In practice at least gcc, clang, and icc do use static as default.Agni
@Agni Yes, you're right that I did not provide any specifics, which is because I do not want a microoptimized answer for a specific code snippet, but rather a more general hint if my assumptions are correct and in this case which way to go to benefit from the available processing power.Hellhole
@Lukas I've recently run into NUMA issues myself...very closely related to what your issues in this post were (threaded access of std containers). Did you ever find a solution to the problem you were having? Was the "first touch" policy marked as the answer below the ultimate solution you went with?Wilkens
@Wilkens Unfortunately I did not come up with a better solution. In my case two multithreaded, heavy-load applications were running simultaneously, so I went for setting each process' thread affinity each to one NUMA-node. So each process' threads are exclusively scheduled to one NUMA-node. This gave better overall performance without modifying anything of the existing codebase.Hellhole
B
8

When you allocate dynamic memory (such as std::vector does) you effectively get some range of pages from virtual memory space. When a program first accesses a particular page, page fault is triggered and some page from physical memory is requested. Usually, this page is in a local physical memory to the core that generated the page fault, which is called a first touch policy.

In your code, if pages of your std::vector's buffers are first touched by a single (e.g, main) thread, then it may happen that all elements of these vectors ends up in a local memory of a single NUMA node. Then, if you split your program to threads that runs on all NUMA nodes, some of the threads accesses remote memory when working with these vectors.

The solution is thus to allocate "raw memory" and then "touch" it first with all threads the same way it will be then accessed by these threads during processing phase. Unfortunately, this is not easy to achieve with std::vector, at least with standard allocators. Can you switch to ordinary dynamic arrays? I would try this first to find out, whether their initialization with respect to first touch policy helps:

int* data = new int[N];
int* res = new int[N];

// initialization with respect to first touch policy
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++) {
   data[i] = ...;
   res[i] = ...;
}

#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
   res[i] = doExtremeComplexStuff(data[i]);

With static scheduling, mapping of elements to threads should the very same in both loops.


However, I am not convinced that your problem is caused by NUMA effects when accessing these two vectors. As you called the function doExtremeComplexStuff, it seems that this function is very expensive as for runtime. If this is true, even accessing remote NUMA memory will likely be negligibly fast in comparison with function invocation. The whole problem can be hidden inside this function, but we don't know what it does.

Brierroot answered 5/3, 2018 at 8:15 Comment(6)
This is a very good answer, but still you are mostly guessing about the answer. This can also only touch the surface, e.g. there is transparent NUMA balancing...Agni
@Agni You likely meant guessing about the question. I completely agree with that.Brierroot
Is it really a good idea to use variable-length-array instead of std::vector? Can a std::vector<int*> do the work?Clermontferrand
@LuoJigao There is not VLA involved here. VLA is something different. You can use vector as well, but the problem is that with the default allocator, it will zero-out elements during resizing. Which in exactly what we want to avoid here. Dynamic arrays were used here for the sake of simplicity, writing a custom allocators that would skip zero-initialization of elements is relatively complex and unrelated topic.Brierroot
@DanielLangr I was curious about your answer since i am working on a similar thing. A simple example: i have a dynamic array with 6 elements, and a memory page has room for four. I write the data with two threads on separate NUMA nodes. Does the first thread write the 3 elements into the physical memory page on node 1? And does the second numa node write the 4th element into node 1's memory? Or do they write separate pages with 3 elements per page? I am using LinuxBerberine
@Berberine The virtual address space is contiguous. I don't think that the system may map 3 elements to the first page and another 3 to the second page, if the page has room for 4 elements.Brierroot

© 2022 - 2024 — McMap. All rights reserved.