I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError
but when I install distributed
to solve this DataBricks just says Cancelled
without throwing any errors.
There is now a dask-databricks package from the Dask community which makes running Dask clusters alongside Spark/Photon on multi-node Databricks quick to set up. This way you can run one cluster and then use either framework on the same infrastructure.
You create an init script that installs dask-databricks
and uses a Dask CLI command to start the Dask cluster components.
#!/bin/bash
# Install Dask + Dask Databricks
/databricks/python/bin/pip install --upgrade dask[complete] dask-databricks
# Start Dask cluster components
dask databricks run
Then in your Databricks Notebook you can get a Dask client object using the dask_databricks.get_client()
utility.
import dask_databricks
client = dask_databricks.get_client()
It also sets up access to the Dask dashboard via the Databricks web proxy.
I don't think we have heard of anyone using Dask under databricks, but so long as it's just python, it may well be possible.
The default scheduler for Dask is threads, and this is the most likely thing to work. In this case you don't even need to install distributed
.
For the Cancelled error, it sounds like you are using distributed, and, at a guess, the system is not allowing you to start extra processes (you could test this with the subprocess
module). To work around, you could do
client = dask.distributed.Client(processes=False)
Of course, if it is indeed the processes that you need, this would not be great. Also, I have no idea how you might expose the dashboard's port.
© 2022 - 2024 — McMap. All rights reserved.