Question is simple:
master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?
Below image is explaning what am trying to do, it errors for some reason, am i missing something here?
Question is simple:
master_dim.py calls dim_1.py and dim_2.py to execute in parallel. Is this possible in databricks pyspark?
Below image is explaning what am trying to do, it errors for some reason, am i missing something here?
Just for others in case they are after how it worked:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(5)
notebooks = ['dim_1', 'dim_2']
pool.map(lambda path: dbutils.notebook.run("/Test/Threading/"+path, timeout_seconds= 60, arguments={"input-data": path}),notebooks)
your problem is that you're passing only Test/
as first argument to the dbutils.notebook.run
(the name of notebook to execute), but you don't have notebook with such name.
You need either modify list of paths from ['Threading/dim_1', 'Threading/dim_2']
to ['dim_1', 'dim_2']
and replace dbutils.notebook.run('Test/', ...)
with dbutils.notebook.run(path, ...)
Or change dbutils.notebook.run('Test/', ...)
to dbutils.notebook.run('/Test/' + path, ...)
Databricks now has workflows/multitask jobs. Your master_dim can call other jobs to execute in parallel after finishing/passing taskvalue parameters to dim_1, dim_2 etc.
© 2022 - 2024 — McMap. All rights reserved.
path
- in this case it's easier to move project into new folder, etc. if thepath
isn't absolute, then it's treated as relative to the current notebook – Gallo