os.sched_getaffinity(0) vs os.cpu_count()
Asked Answered
C

3

14

So, I know the difference between the two methods in the title, but not the practical implications.

From what I understand: If you use more NUM_WORKERS than are cores actually available, you face big performance drops because your OS constantly switches back and forth trying to keep things in parallel. Don't know how true this is, but I read it here on SO somewhere from someone smarter than me.

And in the docs for os.cpu_count() it says:

Return the number of CPUs in the system. Returns None if undetermined. This number is not equivalent to the number of CPUs the current process can use. The number of usable CPUs can be obtained with len(os.sched_getaffinity(0))

So, I'm trying to work out what the "system" refers to if there can be more CPUs usable by a process than there are in the "system".

I just want to safely and efficiently implement multiprocessing.pool functionality. So here is my question summarized:

What are the practical implications of:

NUM_WORKERS = os.cpu_count() - 1
# vs.
NUM_WORKERS = len(os.sched_getaffinity(0)) - 1

The -1 is because I've found that my system is a lot less laggy if I try to work while data is being processed.

Confiscate answered 3/10, 2020 at 21:26 Comment(0)
G
6

If you had a tasks that were pure 100% CPU bound, i.e. did nothing but calculations, then clearly nothing would/could be gained by having a process pool size greater than the number of CPUs available on your computer. But what if there was a mix of I/O thrown in whereby a process would relinquish the CPU waiting for an I/O to complete (or, for example, a URL to be returned from a website, which takes a relatively long time)? To me it's not clear that you couldn't achieve in this scenario improved throughput with a process pool size that exceeds os.cpu_count().

Update

Here is code to demonstrate the point. This code, which would probably be best served by using threading, is using processes. I have 8 cores on my desktop. The program simply retrieves 54 URL's concurrently (or in parallel in this case). The program is passed an argument, the size of the pool to use. Unfortunately, there is initial overhead just to create additional processes so the savings begin to fall off if you create too many processes. But if the task were long running and had a lot of I/O, then the overhead of creating the processes would be worth it in the end:

from concurrent.futures import ProcessPoolExecutor, as_completed
import requests
from timing import time_it

def get_url(url):
    resp = requests.get(url, headers={'user-agent': 'my-app/0.0.1'})
    return resp.text


@time_it
def main(poolsize):
    urls = [
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
    ]
    with ProcessPoolExecutor(poolsize) as executor:
        futures = {executor.submit(get_url, url): url for url in urls}
        for future in as_completed(futures):
            text = future.result()
            url = futures[future]
            print(url, text[0:80])
            print('-' * 100)

if __name__ == '__main__':
    import sys
    main(int(sys.argv[1]))

8 processes: (the number of cores I have):

func: main args: [(8,), {}] took: 2.316840410232544 sec.

16 processes:

func: main args: [(16,), {}] took: 1.7964842319488525 sec.

24 processes:

func: main args: [(24,), {}] took: 2.2560818195343018 sec.
Goldschmidt answered 4/10, 2020 at 13:18 Comment(5)
FWIW, I have code in this answer here which demonstrates your point.Mackler
Is this increase in performance due to "virtual" cores?Confiscate
@Confiscate I have 4 real + 4 virtual cores = 8 ( == os.cpu_count()). The performance increase is due to the fact that the processes that are being created relinquish the core they have (real or virtual) when they are waiting for the URL to be returned and if there is another process waiting for a core to run on, it will now be given a chance.Goldschmidt
Okay, so a process can be created but not assigned a core. Essentially what you are saying is I can start as many processes as I want, which might make sense for lots of I/O or operations that might have some required wait time. During that wait, the process can relinquish the core and allow someone else to work... So my only question is: Do multi-processing pools actually handle this "I'm not doing anything, so I'll let my neighbor have a turn" kind of thinking?Confiscate
@Confiscate I am fairly certain that it is the underlying Operating System (OS) such as Linux or Windows who is now in charge of dispatching a process when a CPU becomes available as the result of another process going into a wait. So, it's done at a lower level than Python's Process classes. But remember, unlike threads, which are fairly lightweight, creating processes you cannot efficiently use (see my example) become costly. That is probably why the (reasonable) default when creating Python pools is the number of actual CPU's you have.Goldschmidt
M
12

These two functions are very different and NUM_WORKERS = os.sched_getaffinity(0) - 1 would just instantly fail with TypeError because you try to subtract an integer from a set. While os.cpu_count() tells you how many cores the system has, os.sched_getaffinity(pid) tells you on which cores a certain thread/process is allowed to run.


os.cpu_count()

os.cpu_count() shows the number of available cores as known to the OS (virtual cores). Most likely you have half this number of physical cores. If it makes sense to use more processes than you have physical cores, or even more than virtual cores, depends very much on what you are doing. The tighter the computational loop (little diversity in instructions, few cache misses, ...), the more likely you won't benefit from more used cores (by using more worker-processes) or even experience performance degradation.

Obviously it also depends on what else your system is running, because your system tries to give every thread (as the actual execution unit of a process) in the system a fair share of run-time on the available cores. So there is no generalization possible in terms of how many workers you should use. But if, for instance, you have a tight loop and your system is idling, a good starting point for optimizing is

os.cpu_count() // 2 # same as mp.cpu_count() // 2 

...and increasing from there.

How @Frank Yellin already mentioned, multiprocessing.Pool uses os.cpu_count() for the number of workers as a default.

os.sched_getaffinity(pid)

os.sched_getaffinity(pid)

Return the set of CPUs the process with PID pid (or the current process if zero) is restricted to.

Now core/cpu/processor/-affinity is about on which concrete (virtual) cores your thread (within your worker-process) is allowed to run. Your OS gives every core an id, from 0 to (number-of-cores - 1) and changing affinity allows restricting ("pinning") on which actual core(s) a certain thread is allowed to run at all.

At least on Linux I found this to mean that if none of the allowed cores is currently available, the thread of a child-process won't run, even if other, non-allowed cores would be idle. So "affinity" is a bit misleading here.

The goal when fiddling with affinity is to minimize cache invalidations from context-switches and core-migrations. Your OS here usually has the better insight and already tries to keep caches "hot" with its scheduling-policy, so unless you know what you're doing, you can't expect easy gains from interfering.

By default the affinity is set to all cores and for multiprocessing.Pool, it doesn't make too much sense bothering with changing that, at least if your system is idle otherwise.

Note that despite the fact the docs here speak of "process", setting affinity really is a per-thread thing. So for example, setting affinity in a "child"-thread for the "current process if zero", does not change the affinity of the main-thread or other threads within the process. But, child-threads inherit their affinity from the main-thread and child-processes (through their main-thread) inherit affinity from the parent's process main-thread. This affects all possible start-methods ("spawn", "fork", "forkserver"). The example below demonstrates this and how to modify affinity with using multiprocessing.Pool.

import multiprocessing as mp
import threading
import os


def _location():
    return f"{mp.current_process().name} {threading.current_thread().name}"


def thread_foo():
    print(f"{_location()}, affinity before change: {os.sched_getaffinity(0)}")
    os.sched_setaffinity(0, {4})
    print(f"{_location()}, affinity after change: {os.sched_getaffinity(0)}")


def foo(_, iterations=200e6):

    print(f"{_location()}, affinity before thread_foo:"
          f" {os.sched_getaffinity(0)}")

    for _ in range(int(iterations)):  # some dummy computation
        pass

    t = threading.Thread(target=thread_foo)
    t.start()
    t.join()

    print(f"{_location()}, affinity before exit is unchanged: "
          f"{os.sched_getaffinity(0)}")

    return _


if __name__ == '__main__':

    mp.set_start_method("spawn")  # alternatives on Unix: "fork", "forkserver"

    # for current process, exclude cores 0,1 from affinity-mask
    print(f"parent affinity before change: {os.sched_getaffinity(0)}")
    excluded_cores = {0, 1}
    os.sched_setaffinity(0, os.sched_getaffinity(0).difference(excluded_cores))
    print(f"parent affinity after change: {os.sched_getaffinity(0)}")

    with mp.Pool(2) as pool:
        pool.map(foo, range(5))

Output:

parent affinity before change: {0, 1, 2, 3, 4, 5, 6, 7}
parent affinity after change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-1, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-1, affinity after change: {4}
SpawnPoolWorker-1 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-1, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-1, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-2, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-2, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 MainThread, affinity before thread_foo: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-2, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-1 Thread-2, affinity after change: {4}
SpawnPoolWorker-1 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-3, affinity before change: {2, 3, 4, 5, 6, 7}
SpawnPoolWorker-2 Thread-3, affinity after change: {4}
SpawnPoolWorker-2 MainThread, affinity before exit is unchanged: {2, 3, 4, 5, 6, 7}
Mackler answered 4/10, 2020 at 12:36 Comment(0)
G
6

If you had a tasks that were pure 100% CPU bound, i.e. did nothing but calculations, then clearly nothing would/could be gained by having a process pool size greater than the number of CPUs available on your computer. But what if there was a mix of I/O thrown in whereby a process would relinquish the CPU waiting for an I/O to complete (or, for example, a URL to be returned from a website, which takes a relatively long time)? To me it's not clear that you couldn't achieve in this scenario improved throughput with a process pool size that exceeds os.cpu_count().

Update

Here is code to demonstrate the point. This code, which would probably be best served by using threading, is using processes. I have 8 cores on my desktop. The program simply retrieves 54 URL's concurrently (or in parallel in this case). The program is passed an argument, the size of the pool to use. Unfortunately, there is initial overhead just to create additional processes so the savings begin to fall off if you create too many processes. But if the task were long running and had a lot of I/O, then the overhead of creating the processes would be worth it in the end:

from concurrent.futures import ProcessPoolExecutor, as_completed
import requests
from timing import time_it

def get_url(url):
    resp = requests.get(url, headers={'user-agent': 'my-app/0.0.1'})
    return resp.text


@time_it
def main(poolsize):
    urls = [
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
        'https://ibm.com',
        'https://microsoft.com',
        'https://google.com',
    ]
    with ProcessPoolExecutor(poolsize) as executor:
        futures = {executor.submit(get_url, url): url for url in urls}
        for future in as_completed(futures):
            text = future.result()
            url = futures[future]
            print(url, text[0:80])
            print('-' * 100)

if __name__ == '__main__':
    import sys
    main(int(sys.argv[1]))

8 processes: (the number of cores I have):

func: main args: [(8,), {}] took: 2.316840410232544 sec.

16 processes:

func: main args: [(16,), {}] took: 1.7964842319488525 sec.

24 processes:

func: main args: [(24,), {}] took: 2.2560818195343018 sec.
Goldschmidt answered 4/10, 2020 at 13:18 Comment(5)
FWIW, I have code in this answer here which demonstrates your point.Mackler
Is this increase in performance due to "virtual" cores?Confiscate
@Confiscate I have 4 real + 4 virtual cores = 8 ( == os.cpu_count()). The performance increase is due to the fact that the processes that are being created relinquish the core they have (real or virtual) when they are waiting for the URL to be returned and if there is another process waiting for a core to run on, it will now be given a chance.Goldschmidt
Okay, so a process can be created but not assigned a core. Essentially what you are saying is I can start as many processes as I want, which might make sense for lots of I/O or operations that might have some required wait time. During that wait, the process can relinquish the core and allow someone else to work... So my only question is: Do multi-processing pools actually handle this "I'm not doing anything, so I'll let my neighbor have a turn" kind of thinking?Confiscate
@Confiscate I am fairly certain that it is the underlying Operating System (OS) such as Linux or Windows who is now in charge of dispatching a process when a CPU becomes available as the result of another process going into a wait. So, it's done at a lower level than Python's Process classes. But remember, unlike threads, which are fairly lightweight, creating processes you cannot efficiently use (see my example) become costly. That is probably why the (reasonable) default when creating Python pools is the number of actual CPU's you have.Goldschmidt
Y
3

The implementation of multiprocessing.pool uses

if processes is None:
    processes = os.cpu_count() or 1

Not sure if that answers your question, but at least it's a datapoint.

Yeorgi answered 3/10, 2020 at 22:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.