How to pin threads to cores with predetermined memory pool objects? (80 core Nehalem architecture 2Tb RAM)
Asked Answered
M

2

14

I've run into a minor HPC problem after running some tests on a 80core (160HT) nehalem architecture with 2Tb DRAM:

A server with more than 2 sockets starts to stall a lot (delay) as each thread starts to request information about objects on the "wrong" socket, i.e. requests goes from a thread that is working on some objects on the one socket to pull information that is actually in the DRAM on the other socket.

The cores appear 100% utilized, even though I know that they are waiting for the remote socket to return the request.

As most of the code runs asynchronously it is a lot easier to rewrite the code so I can just parse messages from the threads on the one socket to threads the other (no locked waiting). In addition I want to lock each threads to memory pools, so I can update objects instead of wasting time (~30%) on the garbage collector.

Hence the question:

How to pin threads to cores with predetermined memory pool objects in Python?

A little more context:

Python has no problem running multicore when you put ZeroMQ in the middle and make an art out of passing messages between the memory pool managed by each ZMQworker. At ZMQ's 8M msg/second it the internal update of the objects take longer than the pipeline can be filled. This is all described here: http://zguide.zeromq.org/page:all#Chapter-Sockets-and-Patterns

So, with a little over-simplification, I spawn 80 ZMQworkerprocesses and 1 ZMQrouter and load the context with a large swarm of objects (584 million objects actually). From this "start-point" the objects need to interact to complete the computation.

This is the idea:

  • If "object X" needs to interact with "Object Y" and is available in the local memory pool of the python-thread, then the interaction should be done directly.
  • If "Object Y" is NOT available in the same pool, then I want it to send a message through the ZMQrouter and let the router return a response at some later point in time. My architecture is non-blocking so what goes on in the particular python thread just continues without waiting for the zmqRouters response. Even for objects on the same socket but on a different core, I would prefer NOT to interact, as I prefer having clean message exchanges instead of having 2 threads manipulating the same memory object.

To do this I need to know:

  1. how to figure out which socket a given python process (thread) runs on.
  2. how assign a memory pool on that particular socket to the python process (some malloc limit or similar so that the sum of memory pools do not push the memory pool from one socket to another)
  3. Things I haven't thought of.

But I cannot find references in the python docs on how to do this and on google I must be searching for the wrong thing.

Update:

Regarding the question "why use ZeroMQ on a MPI architecture?", please read the thread: Spread vs MPI vs zeromq? as the application I am working on is being designed for a distributed deployment even though it is tested on a an architecture where MPI is more suitable.

Update 2:

Regarding the question:

"How to pin threads to cores with predetermined memory pools in Python(3)" the answer is in psutils:

>>> import psutil
>>> psutil.cpu_count()
4
>>> p = psutil.Process()
>>> p.cpu_affinity()  # get
[0, 1, 2, 3]
>>> p.cpu_affinity([0])  # set; from now on, this process will run on CPU #0 only
>>> p.cpu_affinity()
[0]
>>>
>>> # reset affinity against all CPUs
>>> all_cpus = list(range(psutil.cpu_count()))
>>> p.cpu_affinity(all_cpus)
>>>

The worker can be pegged to a core whereby the NUMA may be exploited effectively (lookup your CPU type to verify that it is a NUMA architecture!)

The second element is to determine the memory-pool. That can be done with psutils as well or the resource library:

Mic answered 5/8, 2013 at 11:28 Comment(3)
Can you explain your question with a bit more context? I would naively answer that a Python process cannot run multicore, so you must be talking about 80 (or 160) independent processes here. Pinning them to specific cores can be acheived e.g. with taskset, on Linux (see man tasklet).Tarry
zeromq permits that you drop workload out on all your hyper threads, but that doesn't automatically mean that the memory objects that are created stay with each thread.Mic
Okay. So far so good. I have found that with python Threading I can lock the memory to Threading.local(). Now I just need to pin the thread to the core. Is this a C or kernel job?Mic
C
7

You might underestimate the issue, there is no super-easy way to accomplish what you want. As a general guideline, you need to work at the operating system level to get things set up the way you want. You want to work with so-called "CPU affinity" and "memory affinity" and you need to think hard about your system architecture as well as your software architecture to get things right. In real HPC, the named "affinities" are normally handled by an MPI library, such as Open MPI. You might want to consider using one and let your different processes be handled by that MPI library. The interface between operating system, MPI library and Python can be provided by the mpi4py package.

You also need to get your concept of threads and processes and the OS setting straight. While for the CPU time scheduler, a thread is a task to be scheduled and therefore theoretically could have an individual affinity, I am only aware of affinity masks for entire processes, i.e. for all threads within one process. For controlling memory access, NUMA (non-uniform memory access) is the right keyword and you might want to look into http://linuxmanpages.com/man8/numactl.8.php

In any case, you need to read articles about the affinity topic and might want to start reading in the Open MPI FAQs about CPU/memory affinity: http://www.open-mpi.de/faq/?category=tuning#paffinity-defs

In case you want to achieve your goal without using an MPI library, look into the packages util-linux or schedutils and numactl of your Linux distribution in order to get useful commandline tools such as taskset, which you could e.g. call from within Python in order to set affinity masks for certain process IDs.

This article seems to vividly describe how an MPI library can be helpful with your issue:

http://blogs.cisco.com/performance/open-mpi-v1-5-processor-affinity-options/

This SO answer describes how you bisect your hardware architecture: https://mcmap.net/q/901962/-assign-two-mpi-processes-per-core

Generally, I am wondering if the machine you are applying is the right one for the task or if you maybe are optimizing at the wrong end. If you are messaging within one machine and hitting memory bandwidth limits, I am not sure if ZMQ (through TCP/IP, right?) is the right tool at all to perform the messaging. Coming back to MPI, the message passing interface for HPC applications...

Chrysoprase answered 14/8, 2013 at 0:8 Comment(3)
Thanks for the detailed answer. It will take some time to follow through the links.Mic
Did you come to any decision yet?Chrysoprase
The MPI lib is the place to go.Mic
S
0

Just wondering if this might not be amenable to the use of python remote objects - this might be worth investigation but unfortunately I do not have access to such hardware.

As explained in the documentation while pyro is often used to distribute work across multiple machines on a network it can also be used to share processing between cores on a single machine.

On a lower level Pyro is just a form of inter-process communication. So everywhere you would otherwise have used a more primitive form of IPC (such as plain TCP/IP sockets) between Python components, you could consider to use Pyro instead.

While pyro may add some overhead it may well speed things up and should make things more maintainable.

Seating answered 10/8, 2013 at 8:7 Comment(1)
Can you explain your idea with a little more detail?Mic

© 2022 - 2024 — McMap. All rights reserved.