I've run into a minor HPC problem after running some tests on a 80core (160HT) nehalem architecture with 2Tb DRAM:
A server with more than 2 sockets starts to stall a lot (delay) as each thread starts to request information about objects on the "wrong" socket, i.e. requests goes from a thread that is working on some objects on the one socket to pull information that is actually in the DRAM on the other socket.
The cores appear 100% utilized, even though I know that they are waiting for the remote socket to return the request.
As most of the code runs asynchronously it is a lot easier to rewrite the code so I can just parse messages from the threads on the one socket to threads the other (no locked waiting). In addition I want to lock each threads to memory pools, so I can update objects instead of wasting time (~30%) on the garbage collector.
Hence the question:
How to pin threads to cores with predetermined memory pool objects in Python?
A little more context:
Python has no problem running multicore when you put ZeroMQ in the middle and make an art out of passing messages between the memory pool managed by each ZMQworker. At ZMQ's 8M msg/second it the internal update of the objects take longer than the pipeline can be filled. This is all described here: http://zguide.zeromq.org/page:all#Chapter-Sockets-and-Patterns
So, with a little over-simplification, I spawn 80 ZMQworkerprocesses and 1 ZMQrouter and load the context with a large swarm of objects (584 million objects actually). From this "start-point" the objects need to interact to complete the computation.
This is the idea:
- If "object X" needs to interact with "Object Y" and is available in the local memory pool of the python-thread, then the interaction should be done directly.
- If "Object Y" is NOT available in the same pool, then I want it to send a message through the ZMQrouter and let the router return a response at some later point in time. My architecture is non-blocking so what goes on in the particular python thread just continues without waiting for the zmqRouters response. Even for objects on the same socket but on a different core, I would prefer NOT to interact, as I prefer having clean message exchanges instead of having 2 threads manipulating the same memory object.
To do this I need to know:
- how to figure out which socket a given python process (thread) runs on.
- how assign a memory pool on that particular socket to the python process (some malloc limit or similar so that the sum of memory pools do not push the memory pool from one socket to another)
- Things I haven't thought of.
But I cannot find references in the python docs on how to do this and on google I must be searching for the wrong thing.
Update:
Regarding the question "why use ZeroMQ on a MPI architecture?", please read the thread: Spread vs MPI vs zeromq? as the application I am working on is being designed for a distributed deployment even though it is tested on a an architecture where MPI is more suitable.
Update 2:
Regarding the question:
"How to pin threads to cores with predetermined memory pools in Python(3)" the answer is in psutils:
>>> import psutil
>>> psutil.cpu_count()
4
>>> p = psutil.Process()
>>> p.cpu_affinity() # get
[0, 1, 2, 3]
>>> p.cpu_affinity([0]) # set; from now on, this process will run on CPU #0 only
>>> p.cpu_affinity()
[0]
>>>
>>> # reset affinity against all CPUs
>>> all_cpus = list(range(psutil.cpu_count()))
>>> p.cpu_affinity(all_cpus)
>>>
The worker can be pegged to a core whereby the NUMA may be exploited effectively (lookup your CPU type to verify that it is a NUMA architecture!)
The second element is to determine the memory-pool. That can be done with psutils as well or the resource library:
taskset
, on Linux (seeman tasklet
). – Tarry