400 threads in 20 processes outperform 400 threads in 4 processes while performing an I/O-bound task
Asked Answered
N

1

4

Experimental Code

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of fetching URLs:

import multiprocessing
import sys
import time
import threading
import urllib.request


def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    urls = int(sys.argv[3])

    # Start process workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for n in range(urls):
        in_q.put('http://www.example.com/?n={}'.format(n))

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    # Print time consumed and fetch speed.
    total_time = time.time() - start_time
    fetch_speed = urls / total_time
    print('{} x {} workers => {:.3} s, {:.1f} URLs/s'
          .format(processes, threads, total_time, fetch_speed))



def process_worker(threads, in_q):
    # Start thread workers.
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    # Wait for thread workers to terminate.
    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    # Each thread performs the actual work. In this case, we will assume
    # that the work is to fetch a given URL.
    while True:
        url = in_q.get()
        if url is None:
            break

        with urllib.request.urlopen(url) as u:
            pass # Do nothing
            # print('{} - {} {}'.format(url, u.getcode(), u.reason))


if __name__ == '__main__':
    main()

Here is how I run this program:

python3 foo.py <PROCESSES> <THREADS> <URLS>

For example, python3 foo.py 20 20 10000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and fetches 10000 URLs. In the end, this program prints how much time it took to fetch the URLs and how many URLs it fetched per second on an average.

Note that in all cases I am really hitting a URL of www.example.com domain, i.e., www.example.com is not merely a placeholder. In other words, I run the above code unmodified.

Environment

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

Case 1: 20 Processes x 20 Threads

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes). In each trial, 10,000 URLs are fetched.

Here are the results:

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.12 s, 1954.6 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.28 s, 1895.5 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.22 s, 1914.2 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.38 s, 1859.8 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.19 s, 1925.2 URLs/s

We can see that about 1900 URLs are fetched per second on an average. When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 10% to 15% CPU.

Case 2: 4 Processes x 100 Threads

Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.

$ python3 foo.py 4 100 10000
4 x 100 workers => 9.2 s, 1086.4 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.9 s, 916.5 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 7.8 s, 1282.2 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.3 s, 972.3 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 6.37 s, 1570.9 URLs/s

The CPU usage is between 40% to 60% for each python3 worker process.

Case 3: 1 Process x 400 Threads

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is most certainly due to the global interpreter lock (GIL).

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.5 s, 742.8 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 14.3 s, 697.5 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 761.3 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 15.6 s, 640.4 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 764.4 URLs/s

The CPU usage is between 120% and 125% for the single python3 worker process.

Case 4: 400 Processes x 1 Thread

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.

$ python3 foo.py 400 1 10000
400 x 1 workers => 14.0 s, 715.0 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 6.1 s, 1638.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.08 s, 1413.1 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.23 s, 1382.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 11.3 s, 882.9 URLs/s

The CPU usage is between 1% to 3% for each python3 worker process.

Summary

Picking the median result from each case, we get this summary:

Case 1:  20 x  20 workers => 5.22 s, 1914.2 URLs/s ( 10% to  15% CPU/process)
Case 2:   4 x 100 workers => 9.20 s, 1086.4 URLs/s ( 40% to  60% CPU/process)
Case 3:   1 x 400 workers => 13.5 s,  742.8 URLs/s (120% to 125% CPU/process)
Case 4: 400 x   1 workers => 7.23 s, 1382.9 URLs/s (  1% to   3% CPU/process

Question

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?

Nertie answered 23/5, 2019 at 9:55 Comment(8)
you are really measuring urllib.request.urlopen(url), which will cause a lot of system calls, threads (in the os sense) moved to wait state, etc...Bufford
That method is pure IO so the number of cores isn't significant. It's doing what a web app load tester would do. Most of the time the method is doing nothing, waiting for the OS to complete network IO, or the remote server to respond. The times are affected primarily affected by that remote server, caching, your network bandwidth and your network card's performanceHarangue
@PanagiotisKanavos In that case, I would expect 4 processes x 100 threads to perform as well as 20 processes x 20 threads. Why then 4 processes x 100 threads consistently perform worse than 20 processes x 20 threads? Further, the observation that 1 process x 400 workers performs the worst does show that utilizing only one core at any point in physical time leads to the worst performance.Nertie
@SusamPal why expect that at all? What are you measuring in the first place? It's not the CPU times, or any type of local performance. Did you check your router's performance? Your bandwidth? Your code doesn't do anything CPU-related so you aren't measuring thread performance, you're measuring the network's performanceHarangue
@SusamPal in fact, process limits are exactly what would cause 4 processes with 100 threads to work worse than 20x20. If each process gets the same limits, those 100 threads are getting in each other's way.Harangue
@SusamPal if you want to compare performance use a CPU bound task like calculating prime numbers, adding million-item arrays or something else that actually uses the CPUHarangue
@PanagiotisKanavos Like I mentioned in my question, I am running this code in a Linode virtual private server which is consistently giving me a download bandwidth of about 800-1000 Mbit/s and upload bandwidth of about 600-800 Mbit/s. I know there are many variables involved here between my server and example.com. That's why I did not rely on a single test to obtain these results but in fact, performed several tests in random order to ensure that any results I share in this question are results I can observe consistently.Nertie
@PanagiotisKanavos This question is about I/O-bound task on purpose because I observed these results with an I/O-bound task which got me curious. Comparing the performance of a CPU-bound task would be a separate question. I have asked it here: https://mcmap.net/q/131285/-400-threads-in-20-processes-outperform-400-threads-in-4-processes-while-performing-a-cpu-bound-task-on-4-cpus/303363. If you think process limits cause 4 processes x 100 threads to perform worse than 20 processes x 20 threads, then that could be an answer to this question and you can post that as answer.Nertie
T
2

Your task is I/O-bound rather than CPU-bound: threads spend most of the time in sleep state waiting for network data and such rather than using the CPU.

So adding more threads than CPUs works here as long as I/O is still the bottleneck. The effect will only subside once there are so many threads that enough of them are ready at a time to start actively competing for CPU cycles (or when your network bandwidth is exhausted, whichever comes first).


As for why 20 threads per process is faster than 100 threads per process: this is most likely due to CPython's GIL. Python threads in the same process need to wait not only for I/O but for each other, too.
When dealing with I/O, Python machinery:

  1. Converts all Python objects involved into C objects (in many cases, this can be done without physically copying the data)
  2. Releases the GIL
  3. Perform the I/O in C (which involves waiting for it for arbitrary time)
  4. Reacquires the GIL
  5. Converts the result to a Python object if applicable

If there are enough threads in the same process, it becomes increasigly likely that another one is active when step 4 is reached, causing an additional random delay.


Now, when it comes to lots of processes, other factors come into play like memory swapping (since unlike threads, processes running the same code don't share memory) (I'm pretty sure there are other delays from lots of processes as opposed to threads competing for resources but can't point it from the top of my head). That's why the performance becomes unstable.

Throne answered 23/5, 2019 at 10:38 Comment(5)
I understand that adding more threads than CPUs works here because the task is I/O-bound. What I am confused about is why the distribution of threads between processes matter. Why does running 400 threads within 20 processes (i.e., 20 threads per process) consistently perform better than running 400 threads within 4 processes (i.e., 100 threads per process)?Nertie
@SusamPal Most probably, this is Python GIL's overhead.Throne
Not sure the heart of the matter has been addressed: 20 processes would by your answer require more overhead than 4 processes for the same number of threads yet the observation by the OP is the opposite.Grubstake
@javadba OP doesn't have "4 processes for the same number of threads", they have 4 processes with more threads per process. I showed how more threads per process adds overhead.Throne
yes we do have the same number of threads: 400 in all cases. But as for "more more threads per process adds overhead" that may be true - but there is the flip side: having more processes requires copying data to each of the processes. The startup cost and the overall memory required is higher for more processes/fewer threads per process. What is claimed in your answer is that the cost of GIL switching for each process is higher when there are more threads per process. There is not a single correct answer for which approach is better /more performant that applies to all scenarios.Grubstake

© 2022 - 2024 — McMap. All rights reserved.