How to use Dask on Databricks
Asked Answered
D

2

9

I want to use Dask on Databricks. It should be possible (I cannot see why not). If I import it, one of two things happens, either I get an ImportError but when I install distributed to solve this DataBricks just says Cancelled without throwing any errors.

Desorb answered 4/6, 2019 at 12:53 Comment(0)
G
11

There is now a dask-databricks package from the Dask community which makes running Dask clusters alongside Spark/Photon on multi-node Databricks quick to set up. This way you can run one cluster and then use either framework on the same infrastructure.

You create an init script that installs dask-databricks and uses a Dask CLI command to start the Dask cluster components.

#!/bin/bash

# Install Dask + Dask Databricks
/databricks/python/bin/pip install --upgrade dask[complete] dask-databricks

# Start Dask cluster components
dask databricks run

Then in your Databricks Notebook you can get a Dask client object using the dask_databricks.get_client() utility.

import dask_databricks

client = dask_databricks.get_client()

It also sets up access to the Dask dashboard via the Databricks web proxy.

Gariepy answered 14/12, 2023 at 12:1 Comment(0)
C
1

I don't think we have heard of anyone using Dask under databricks, but so long as it's just python, it may well be possible.

The default scheduler for Dask is threads, and this is the most likely thing to work. In this case you don't even need to install distributed.

For the Cancelled error, it sounds like you are using distributed, and, at a guess, the system is not allowing you to start extra processes (you could test this with the subprocess module). To work around, you could do

client = dask.distributed.Client(processes=False)

Of course, if it is indeed the processes that you need, this would not be great. Also, I have no idea how you might expose the dashboard's port.

Choli answered 4/6, 2019 at 14:31 Comment(5)
This sadly still didn't work. However, this is starting to appear as a genuine limitation of Databricks itself which is sad because I actually think Dask is the future of distributed computing in python.Desorb
Don't tell Databricks that! :)Choli
Hi SARose - I'm curious as to WHY you want to use Dask on Databricks? ie. what's the driver here?Curvilinear
1. For data-folks (RS, DS, MLEs), Spark errors and interplay between spark <=> ML libraries is just substandard at best \n 2. even with Koalas (pandas_on_spark), the underlying code is scala/spark. There are very opaque tasks that the actual user has usually no transparency into for debugging.Dumont
Coiled's page at coiled.io/blog/spark-vs-dask gives a good explanation of why one might choose Dask over Spark, or vice versa. For me, my team comes from a heavy Python/C++ background, we have compute graphs similar to the one shown on the Coiled page but much more complex again (very far from Databricks and Spark's more data-centric home turf), we want deep control over what we run, and we value that Dask is open-source and "plays well with others".Yuzik

© 2022 - 2024 — McMap. All rights reserved.