Jupyter notebook never finishes processing using multiprocessing (Python 3)
Asked Answered
K

5

64

enter image description here

Jupyter Notebook

I am using multiprocessing module basically, I am still learning the capabilities of multiprocessing. I am using the book by Dusty Phillips and this code belongs to it.

import multiprocessing  
import random
from multiprocessing.pool import Pool

def prime_factor(value):
    factors = []
    for divisor in range(2, value-1):
        quotient, remainder = divmod(value, divisor)
        if not remainder:
            factors.extend(prime_factor(divisor))
            factors.extend(prime_factor(quotient))
            break
        else:
            factors = [value]
    return factors

if __name__ == '__main__':
    pool = Pool()
    to_factor = [ random.randint(100000, 50000000) for i in range(20)]
    results = pool.map(prime_factor, to_factor)
    for value, factors in zip(to_factor, results):
        print("The factors of {} are {}".format(value, factors))

On the Windows PowerShell (not on jupyter notebook) I see the following

Process SpawnPoolWorker-5:
Process SpawnPoolWorker-1:
AttributeError: Can't get attribute 'prime_factor' on <module '__main__' (built-in)>

I do not know why the cell never ends running?

Kishakishinev answered 15/11, 2017 at 17:26 Comment(0)
K
57

It seems that the problem in Jupyter notebook as in different ide is the design feature. Therefore, we have to write the function (prime_factor) into a different file and import the module. Furthermore, we have to take care of the adjustments. For example, in my case, I have coded the function into a file known as defs.py

def prime_factor(value):
    factors = []
    for divisor in range(2, value-1):
        quotient, remainder = divmod(value, divisor)
        if not remainder:
            factors.extend(prime_factor(divisor))
            factors.extend(prime_factor(quotient))
            break
        else:
            factors = [value]
    return factors

Then in the jupyter notebook I wrote the following lines

import multiprocessing  
import random
from multiprocessing import Pool
import defs



if __name__ == '__main__':
    pool = Pool()
    to_factor = [ random.randint(100000, 50000000) for i in range(20)]
    results = pool.map(defs.prime_factor, to_factor)
    for value, factors in zip(to_factor, results):
        print("The factors of {} are {}".format(value, factors))

This solved my problem

enter image description here

Kishakishinev answered 19/11, 2017 at 8:2 Comment(2)
It works using Pool but doesn't work using Process. What could be the reason?Pigeontoed
Mayby it is obviously, but for the next readers: If pool initializing function like prime_factor() in the question calls another functions they also must be putted in the same package together with prime_factor()Wester
L
15

To execute a function without having to write it into a separated file manually:

We can dynamically write the task to process into a temporary file, import it and execute the function.

from multiprocessing import Pool
from functools import partial
import inspect

def parallel_task(func, iterable, *params):
    
    with open(f'./tmp_func.py', 'w') as file:
        file.write(inspect.getsource(func).replace(func.__name__, "task"))
        
    from tmp_func import task

    if __name__ == '__main__':
        func = partial(task, params)
        pool = Pool(processes=8)
        res = pool.map(func, iterable)
        pool.close()
        return res
    else:
        raise "Not in Jupyter Notebook"

We can then simply call it in a notebook cell like this:

def long_running_task(params, id):
    # Heavy job here
    return params, id

data_list = range(8)

for res in parallel_task(long_running_task, data_list, "a", 1, "b"):
    print(res) 

Ouput:

('a', 1, 'b') 0
('a', 1, 'b') 1
('a', 1, 'b') 2
('a', 1, 'b') 3
('a', 1, 'b') 4
('a', 1, 'b') 5
('a', 1, 'b') 6
('a', 1, 'b') 7

Note: If you're using Anaconda and if you want to see the progress of the heavy task, you can use print() inside long_running_task(). The content of the print will be displayed in the Anaconda Prompt console.

Luiseluiza answered 19/1, 2019 at 11:30 Comment(3)
I have a follow-up question about this post. What if I have a dictionary (say, id is a dictionary) going into long_running_task; how should I change the parallal_task function?Lamere
@H4dr1en. Good trick. I would add this slight modification : pool = Pool(processes=8) -> pool = Pool(processes=len(iterable))Millisent
Would this work with functions that depend on other functions within the same notebook?Cisterna
B
6

Strictly, Python multiprocessing isn't supported on Windows Jupyter Notebook even if __name__="__main__" is added.

One workaround in Windows 10 is to connect windows browser with Jupyter server in WSL.

You could get the same experience as Linux.

You can set it manually or refer the script in https://github.com/mszhanyi/gemini

Booklover answered 27/5, 2021 at 11:27 Comment(0)
H
3

Another option: use dask, which plays nicely with Jupyter. Even if you don't need any of dask special data structures, you can use it simply to control multiple processes.

Histogen answered 10/3, 2022 at 21:17 Comment(0)
W
1

To handle the many quirks of getting multiprocess to play nice in Jupyter session, I've created a library mpify which allows one-time, multiprocess function executions, and passing things from the notebook to the subprocess with a simple API.

The Jupyter shell process itself can participate as a worker process. User can choose to gather results from all workers, or just one of them.

Here it is:

https://github.com/philtrade/mpify

Under the hood, it uses multiprocess -- an actively supported fork from the standard python multiprocessing library -- to allow locally defined variables/functions in the notebook, to be accessible in the subprocesses. It also uses the spawn start method, which is necessary if the subprocesses are to use multiple GPUs, an increasingly common use case. It uses Process() not Pool(), from the multiprocess API.

User can supply a custom context manager to acquire resources, setup/tear down execution environment surrounding the function execution. I've provided a sample context manager to support PyTorch's distributed data parallel (DDP) set up, and many more examples of how to train fastai v2 in Jupyter on multiple GPUs using DDP.

Bug reports, PRs, use cases to share are all welcome.

By no means a fancy/powerful library, mpify only intends to support single-host/multiprocess kind of distributed setup, and simply spawn-execute-terminate. Nor does it support persistent pool of processes and fancy task scheduling -- ipyparallel or dask already does it.

I hope it can be useful to folks who're struggling with Jupyter + multiprocessing, and possible with multi-GPUs as well. Thanks.

Wesson answered 21/7, 2020 at 20:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.