Execute multiple notebooks in parallel in pyspark databricks
H

3

8

Question is simple:

master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?

Below image is explaning what am trying to do, it errors for some reason, am i missing something here?

enter image description here

Heather answered 26/8, 2021 at 11:31 Comment(0)
H
16

Just for others in case they are after how it worked:

from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
Heather answered 26/8, 2021 at 23:44 Comment(3)
you can just use path - in this case it's easier to move project into new folder, etc. if the path isn't absolute, then it's treated as relative to the current notebookGallo
The limitation with this approach is you can't share dependencies with the parallel jobs. I hope databricks can improve this so we can pass not only strings to the called notebookPapal
I will create level 2 list and run after the level 1 list has completed. Gives control.Heather
G
5

your problem is that you're passing only Test/ as first argument to the dbutils.notebook.run (the name of notebook to execute), but you don't have notebook with such name.

You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2'] to ['dim_1', 'dim_2'] and replace dbutils.notebook.run('Test/', ...) with dbutils.notebook.run(path, ...)

Or change dbutils.notebook.run('Test/', ...) to dbutils.notebook.run('/Test/' + path, ...)

Gallo answered 26/8, 2021 at 12:12 Comment(0)
S
0

Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.

Sabotage answered 2/10, 2022 at 3:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.