Tracking progress of joblib.Parallel execution

A

11

66

Is there a simple way to track the overall progress of a joblib.Parallel execution?

I have a long-running execution composed of thousands of jobs, which I want to track and record in a database. However, to do that, whenever Parallel finishes a task, I need it to execute a callback, reporting how many remaining jobs are left.

I've accomplished a similar task before with Python's stdlib multiprocessing.Pool, by launching a thread that records the number of pending jobs in Pool's job list.

Looking at the code, Parallel inherits Pool, so I thought I could pull off the same trick, but it doesn't seem to use these that list, and I haven't been able to figure out how else to "read" it's internal status any other way.

Ayer answered 27/7, 2014 at 17:20 Comment(0)

K

27

Why can't you simply use tqdm? The following worked for me

from joblib import Parallel, delayed
from datetime import datetime
from tqdm import tqdm

def myfun(x):
    return x**2

results = Parallel(n_jobs=8)(delayed(myfun)(i) for i in tqdm(range(1000))
100%|██████████| 1000/1000 [00:00<00:00, 10563.37it/s]

Kenward answered 20/4, 2018 at 22:59 Comment(10)

I don't think this is actually monitoring the completion of running jobs, just the queuing of jobs. If you were to insert a time.sleep(1) at the start of myfun you would find the tqdm progress finishes almost instantly, but results takes a few more seconds to populate. – Clerissa 1/2, 2019 at 0:37

Yes, that’s partly correct. It is tracking the job starts vs the completions, but the other issue is that there is also a delay caused by overhead after all jobs are completed. Once all tasks are completed results need to be collected and this can take quite a while. – Kenward 1/2, 2019 at 3:29

I believe this answer doesn't really answer the question. As it was mentioned, one will track queuing and not the execution itself with this approach. The approach with callback shown below seems to be more precise in relation to the question. – Stacistacia 22/5, 2019 at 12:12

@Stacistacia yes, that was addressed in the former comment. – Kenward 23/5, 2019 at 2:0

This answer is incorrect, as it does not answer the question. This answer should be unaccepted. – Leena 11/7, 2020 at 12:38

The provided answer by frenzykryger below contains a great solution to the problem of this answer. – Achieve 20/8, 2020 at 8:6

It's wrong. It only counts the job start times which happens immediately. – Mcneil 25/3, 2021 at 13:8

This worked for me with a reasonably complex logistic regression function called on thousands of probes in parallel: stats = parallel(func(data, phenotype) for data in tqdm(meth_data, total=len(all_probes), desc='Probes') ) [meth_data is a dataframe and I'm passing each column through the function] – Portfolio 25/2, 2022 at 20:30

Oct 2022 and this wrong answer is still the accepted answer. This will just show the progress of start of jobs. @Ayer please change the accepted answer. – Ikhnaton 24/10, 2022 at 15:14

While this answer is indeed technically wrong, as several comments have pointed out, it's still useful: it's the simplest solution, way easier to do that the other answers, and when I'm using it with a large number of short jobs, completion is not very long after the queueing, so in some cases it can be good enough. – Thagard 21/7, 2023 at 14:2

A

93

Yet another step ahead from dano's and Connor's answers is to wrap the whole thing as a context manager:

import contextlib
import joblib
from tqdm import tqdm

@contextlib.contextmanager
def tqdm_joblib(tqdm_object):
    """Context manager to patch joblib to report into tqdm progress bar given as argument"""
    class TqdmBatchCompletionCallback(joblib.parallel.BatchCompletionCallBack):
        def __call__(self, *args, **kwargs):
            tqdm_object.update(n=self.batch_size)
            return super().__call__(*args, **kwargs)

    old_batch_callback = joblib.parallel.BatchCompletionCallBack
    joblib.parallel.BatchCompletionCallBack = TqdmBatchCompletionCallback
    try:
        yield tqdm_object
    finally:
        joblib.parallel.BatchCompletionCallBack = old_batch_callback
        tqdm_object.close()

Then you can use it like this and don't leave monkey patched code once you're done:

from math import sqrt
from joblib import Parallel, delayed

with tqdm_joblib(tqdm(desc="My calculation", total=10)) as progress_bar:
    Parallel(n_jobs=16)(delayed(sqrt)(i**2) for i in range(10))

which is awesome I think and it looks similar to tqdm pandas integration.

Amandaamandi answered 19/11, 2019 at 14:50 Comment(5)

Excellent solution. Tested with joblib 0.14.1 and tqdm 4.41.0 -- works great. This would be a great addition to tqdm! – Neuroblast 2/4, 2020 at 18:40

I can't edit it, but minor typo in solution where joblib.parallel.BatchCompletionCallback is actually BatchCompletionCallBack (note the camelcase on CallBack) – Whereon 15/6, 2020 at 16:35

I just posted this code to PyPI: github.com/louisabraham/tqdm_joblib Now you can just pip install tqdm_joblib and from tqdm_joblib import tqdm_joblib – Grandmotherly 23/5, 2022 at 5:37

i think this is no longer working – Indeciduous 17/10, 2022 at 14:2

Fantastic, this works out of the box, thank you, featuredpeow and AlanSTACK! I have also successfully tested using a Parallel context within this context, so with tqdm_joblib() as progress_bar: with Parallel as parallel: <code> – Counterreply 25/1 at 21:43

K

27

Why can't you simply use tqdm? The following worked for me

from joblib import Parallel, delayed
from datetime import datetime
from tqdm import tqdm

def myfun(x):
    return x**2

results = Parallel(n_jobs=8)(delayed(myfun)(i) for i in tqdm(range(1000))
100%|██████████| 1000/1000 [00:00<00:00, 10563.37it/s]