Experimental Code
Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of fetching URLs:
import multiprocessing
import sys
import time
import threading
import urllib.request
def main():
processes = int(sys.argv[1])
threads = int(sys.argv[2])
urls = int(sys.argv[3])
# Start process workers.
in_q = multiprocessing.Queue()
process_workers = []
for _ in range(processes):
w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
w.start()
process_workers.append(w)
start_time = time.time()
# Feed work.
for n in range(urls):
in_q.put('http://www.example.com/?n={}'.format(n))
# Send sentinel for each thread worker to quit.
for _ in range(processes * threads):
in_q.put(None)
# Wait for workers to terminate.
for w in process_workers:
w.join()
# Print time consumed and fetch speed.
total_time = time.time() - start_time
fetch_speed = urls / total_time
print('{} x {} workers => {:.3} s, {:.1f} URLs/s'
.format(processes, threads, total_time, fetch_speed))
def process_worker(threads, in_q):
# Start thread workers.
thread_workers = []
for _ in range(threads):
w = threading.Thread(target=thread_worker, args=(in_q,))
w.start()
thread_workers.append(w)
# Wait for thread workers to terminate.
for w in thread_workers:
w.join()
def thread_worker(in_q):
# Each thread performs the actual work. In this case, we will assume
# that the work is to fetch a given URL.
while True:
url = in_q.get()
if url is None:
break
with urllib.request.urlopen(url) as u:
pass # Do nothing
# print('{} - {} {}'.format(url, u.getcode(), u.reason))
if __name__ == '__main__':
main()
Here is how I run this program:
python3 foo.py <PROCESSES> <THREADS> <URLS>
For example, python3 foo.py 20 20 10000
creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and fetches 10000 URLs. In the end, this program prints how much time it took to fetch the URLs and how many URLs it fetched per second on an average.
Note that in all cases I am really hitting a URL of www.example.com
domain, i.e., www.example.com
is not merely a placeholder. In other words, I run the above code unmodified.
Environment
I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. It is running Debian 9.
$ cat /etc/debian_version
9.9
$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39)
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
$ free -m
total used free shared buff/cache available
Mem: 7987 67 7834 10 85 7734
Swap: 511 0 511
$ nproc
4
Case 1: 20 Processes x 20 Threads
Here are a few trial runs with 400 worker threads distributed between 20 worker processes (i.e., 20 worker threads in each of the 20 worker processes). In each trial, 10,000 URLs are fetched.
Here are the results:
$ python3 foo.py 20 20 10000
20 x 20 workers => 5.12 s, 1954.6 URLs/s
$ python3 foo.py 20 20 10000
20 x 20 workers => 5.28 s, 1895.5 URLs/s
$ python3 foo.py 20 20 10000
20 x 20 workers => 5.22 s, 1914.2 URLs/s
$ python3 foo.py 20 20 10000
20 x 20 workers => 5.38 s, 1859.8 URLs/s
$ python3 foo.py 20 20 10000
20 x 20 workers => 5.19 s, 1925.2 URLs/s
We can see that about 1900 URLs are fetched per second on an average. When I monitor the CPU usage with the top
command, I see that each python3
worker process consumes about 10% to 15% CPU.
Case 2: 4 Processes x 100 Threads
Now I thought that I only have 4 CPUs. Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time.
Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate.
But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each.
$ python3 foo.py 4 100 10000
4 x 100 workers => 9.2 s, 1086.4 URLs/s
$ python3 foo.py 4 100 10000
4 x 100 workers => 10.9 s, 916.5 URLs/s
$ python3 foo.py 4 100 10000
4 x 100 workers => 7.8 s, 1282.2 URLs/s
$ python3 foo.py 4 100 10000
4 x 100 workers => 10.3 s, 972.3 URLs/s
$ python3 foo.py 4 100 10000
4 x 100 workers => 6.37 s, 1570.9 URLs/s
The CPU usage is between 40% to 60% for each python3
worker process.
Case 3: 1 Process x 400 Threads
Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. This is most certainly due to the global interpreter lock (GIL).
$ python3 foo.py 1 400 10000
1 x 400 workers => 13.5 s, 742.8 URLs/s
$ python3 foo.py 1 400 10000
1 x 400 workers => 14.3 s, 697.5 URLs/s
$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 761.3 URLs/s
$ python3 foo.py 1 400 10000
1 x 400 workers => 15.6 s, 640.4 URLs/s
$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 764.4 URLs/s
The CPU usage is between 120% and 125% for the single python3
worker process.
Case 4: 400 Processes x 1 Thread
Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread.
$ python3 foo.py 400 1 10000
400 x 1 workers => 14.0 s, 715.0 URLs/s
$ python3 foo.py 400 1 10000
400 x 1 workers => 6.1 s, 1638.9 URLs/s
$ python3 foo.py 400 1 10000
400 x 1 workers => 7.08 s, 1413.1 URLs/s
$ python3 foo.py 400 1 10000
400 x 1 workers => 7.23 s, 1382.9 URLs/s
$ python3 foo.py 400 1 10000
400 x 1 workers => 11.3 s, 882.9 URLs/s
The CPU usage is between 1% to 3% for each python3
worker process.
Summary
Picking the median result from each case, we get this summary:
Case 1: 20 x 20 workers => 5.22 s, 1914.2 URLs/s ( 10% to 15% CPU/process)
Case 2: 4 x 100 workers => 9.20 s, 1086.4 URLs/s ( 40% to 60% CPU/process)
Case 3: 1 x 400 workers => 13.5 s, 742.8 URLs/s (120% to 125% CPU/process)
Case 4: 400 x 1 workers => 7.23 s, 1382.9 URLs/s ( 1% to 3% CPU/process
Question
Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs?
urllib.request.urlopen(url)
, which will cause a lot of system calls, threads (in the os sense) moved to wait state, etc... – Buffordexample.com
. That's why I did not rely on a single test to obtain these results but in fact, performed several tests in random order to ensure that any results I share in this question are results I can observe consistently. – Nertie