What is the correct way to install the delta module in python? - McMap

About

What is the correct way to install the delta module in python?

Asked 17/12, 2019 at 11:37 Answered 28/7, 2024 at 1:25

Solved pyspark databricks delta-lake

Q

8

13

What is the correct way to install the delta module in python??

In the example they import the module

from delta.tables import *

but i did not find the correct way to install the module in my virtual env

Currently i am using this spark param -

"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0"

Querist answered 17/12, 2019 at 11:37 Comment(1)

See my answer on how to do this with Delta 1.2 & PySpark 3.2. The other answers are outdated. – Woofer 1/6, 2022 at 2:17

U

7

Because Delta's Python codes are stored inside a jar and loaded by Spark, delta module cannot be imported until SparkSession/SparkContext is created.

Usm answered 19/12, 2019 at 11:37 Comment(9)

I created a SparkSession, but still get that error. Do you have code that works? – Leannleanna 3/6, 2020 at 8:18

I am not 100% sure, but I don't think from delta.tables import * will work outside of a Databricks Runtime. You can however use delta tables, just not specific delta table utilities. – Smidgen 18/7, 2020 at 21:33

How did you start pyspark? If you run a command like pyspark --packages io.delta:delta-core_2.11:0.5.0 ..., it should work. – Usm 19/7, 2020 at 4:53

started python then

SparkSession.builder.config("spark.jars.packages",'io.delta:delta-core_2.11:0.6.1').config("spark.delta.logStore.class","org.apache.spark.sql.delta.storage.S3SingleDriverLogStore").config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension").config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate()

Reading and writing Delta Tables works, from delta.tables import * does not. However, it does when I start the pyspark REPL as you do. - I'll have to figure this out. – Smidgen 19/7, 2020 at 11:45

Now from delta.tables import * is working from SparkSession started after python, and spark-submit --properties-file /path/to/my/spark-defaults.conf with spark.jars.packages io.delta:delta-core_2.11:0.6.1 in the .conf file. I have no idea what was the issue before. – Smidgen 19/7, 2020 at 12:7

spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of spark-submit. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. – Usm 19/7, 2020 at 21:21

issues.apache.org/jira/browse/SPARK-21752 looks like there is already a ticket for this. – Usm 19/7, 2020 at 21:32

@Querist This works for me with delta.io, outside of Databricks. spark = SparkSession \ .builder \ .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ .enableHiveSupport() \ .getOrCreate() – Twinned 11/12, 2020 at 13:28

Just follow the official documentation, docs.delta.io/latest/quick-start.html#python, there is a nice example how to run it in python. You need to import after spark initialization, that is all! – Consort 15/2, 2021 at 15:13

C

10

As the correct answer is hidden in the comments of the accepted solution, I thought I'd add it here.

You need to create your spark context with some extra settings and then you can import delta:

spark_session = SparkSession.builder \
    .master("local") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

from delta.tables import *

Annoyingly, your IDE will of course shout at you about this as the package isn't installed and you will also be operating without autocomplete and type hints. I'm sure there's a work around and I will update if I come accross it.

The package itself is on their github here and the readme suggests you can pip install but that doesn't work. In theory you could clone it and install manually.

Chole answered 19/5, 2021 at 13:58 Comment(3)

Did you find the workaround for autocomplete? – Mythicize 28/6, 2021 at 8:49

No, I was just hacking and had to put this down. Theoretically you can go grab the package from their github (link in answer) and then install it but there's not a setup.py so that's not a simple task. An alternative (and hacky) solution may be to just pull the tables code (github.com/delta-io/delta/blob/master/python/delta/tables.py) and put it in your app. – Chole 30/6, 2021 at 12:20

There is any way of installing the module with a package manager like poetry? – Basilicata 1/9, 2023 at 20:59

S

8

To run Delta locally with PySpark, you need to follow the official documentation.

This works for me but only when executing directly the script (python <script_file>), not with pytest or unittest.

To solve this problem, you need to add this environment variable:

PYSPARK_SUBMIT_ARGS='--packages io.delta:delta-core_2.12:1.0.0 pyspark-shell'

Use Scala and Delta version that match your case. With this environment variable, I can run pytest or unittest via cli without any problem

from unittest import TestCase

from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession


class TestClass(TestCase):
    
    builder = SparkSession.builder.appName("MyApp") \
        .master("local[*]")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    
    spark = configure_spark_with_delta_pip(builder).getOrCreate()

    def test_create_delta_table(self):
            self.spark.sql("""CREATE IF NOT EXISTS TABLE <tableName> (
                              <field1> <type1>)
                              USING DELTA""")

The function configure_spark_with_delta_pip appends a config option in builder object

.config("io.delta:delta-core_<scala_version>:<delta_version>")

Sparing answered 3/8, 2021 at 9:9 Comment(0)

U

7

Because Delta's Python codes are stored inside a jar and loaded by Spark, delta module cannot be imported until SparkSession/SparkContext is created.

Usm answered 19/12, 2019 at 11:37 Comment(9)

I created a SparkSession, but still get that error. Do you have code that works? – Leannleanna 3/6, 2020 at 8:18

I am not 100% sure, but I don't think from delta.tables import * will work outside of a Databricks Runtime. You can however use delta tables, just not specific delta table utilities. – Smidgen 18/7, 2020 at 21:33

How did you start pyspark? If you run a command like pyspark --packages io.delta:delta-core_2.11:0.5.0 ..., it should work. – Usm 19/7, 2020 at 4:53

started python then

SparkSession.builder.config("spark.jars.packages",'io.delta:delta-core_2.11:0.6.1').config("spark.delta.logStore.class","org.apache.spark.sql.delta.storage.S3SingleDriverLogStore").config("spark.sql.extensions","io.delta.sql.DeltaSparkSessionExtension").config("spark.sql.catalog.spark_catalog","org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate()

Reading and writing Delta Tables works, from delta.tables import * does not. However, it does when I start the pyspark REPL as you do. - I'll have to figure this out. – Smidgen 19/7, 2020 at 11:45

Now from delta.tables import * is working from SparkSession started after python, and spark-submit --properties-file /path/to/my/spark-defaults.conf with spark.jars.packages io.delta:delta-core_2.11:0.6.1 in the .conf file. I have no idea what was the issue before. – Smidgen 19/7, 2020 at 12:7

spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of spark-submit. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. – Usm 19/7, 2020 at 21:21

issues.apache.org/jira/browse/SPARK-21752 looks like there is already a ticket for this. – Usm 19/7, 2020 at 21:32

@Querist This works for me with delta.io, outside of Databricks. spark = SparkSession \ .builder \ .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \ .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ .enableHiveSupport() \ .getOrCreate() – Twinned 11/12, 2020 at 13:28

Just follow the official documentation, docs.delta.io/latest/quick-start.html#python, there is a nice example how to run it in python. You need to import after spark initialization, that is all! – Consort 15/2, 2021 at 15:13

W

1

Here's how you can install Delta Lake & PySpark with conda.

Make sure you have Java installed (I use SDKMAN to manage multiple Java versions)
Install Miniconda
Pick Delta Lake & PySpark versions that are compatible. For example, Delta Lake 1.2 is compatible with PySpark 3.2.
Create a YAML file with the required dependencies, here is an example from the delta-examples repo I created.
Create the environment with a command like conda env create envs/mr-delta.yml
Activate the conda environment with conda activate mr-delta
Here is an example notebook. Note that it starts with the following code:

import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

Woofer answered 1/6, 2022 at 2:16 Comment(0)

I

1

Just install the lib:

!pip install pyspark
!pip install delta-spark

And then use as you want

from pyspark.sql import SparkSession
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages  org.apache.spark:spark-avro_2.12:3.4.1,io.delta:delta-core_2.12:2.4.0 pyspark-shell'
#spark = SparkSession.builder.appName("Basics").getOrCreate()

builder = SparkSession.builder.appName("Basics").master("local") \
            .config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0") \
            .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
            .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
            .config("spark.sql.execution.arrow.pyspark.enabled", "true") \
            .config("spark.hadoop.fs.s3a.path.style.access", "true") \
            .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
            .config("spark.databricks.delta.retentionDurationCheck.enabled", "false") \
            .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") 

my_packages = ["org.apache.hadoop:hadoop-aws:3.3.4",
               "org.apache.hadoop:hadoop-client-runtime:3.3.4",
               "org.apache.hadoop:hadoop-client-api:3.3.4",
               "io.delta:delta-contribs_2.12:3.0.0",
               "io.delta:delta-hive_2.12:3.0.0",
               "com.amazonaws:aws-java-sdk-bundle:1.12.603",
               ]

from delta import *
# Create a Spark instance with the builder
# As a result, you now can read and write Delta tables
spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate()

Ingres answered 28/7, 2024 at 1:25 Comment(0)

V

0

If you are facing issues with Jupyter notebook add the below environment variable

from pyspark.sql import SparkSession
import os
from delta import *

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages  org.apache.spark:spark-avro_2.12:3.4.1,io.delta:delta-core_2.12:2.4.0 pyspark-shell'
# RUN spark-shell --packages org.apache.spark:spark-avro_2.12:3.4.1
# RUN spark-shell --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

builder = SparkSession.builder.appName("SampleSpark") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = builder.getOrCreate()

Ventilate answered 3/7, 2023 at 11:6 Comment(0)

C

0

I was trying to pip install delta-spark, using a python -m venv, and the pylance wasn't able to find the delta package when trying to import "from delta.tables import *".

Changing from venv to virtualenv solved my problem. So just type pip install virtualenv, then create a new environment, then run pip install delta-spark

Civism answered 11/6, 2024 at 10:11 Comment(0)

O

-1

In my case the issue was I had a Cluster running on a Databricks Runtime lower than 6.1

https://docs.databricks.com/delta/delta-update.html

The Python API is available in Databricks Runtime 6.1 and above.

After changing the Databricks Runtime to 6.4 problem disappeared.

To do that: Click clusters -> Pick the one you are using -> Edit -> Pick Databricks Runtime 6.1 and above

Oman answered 20/3, 2020 at 11:13 Comment(3)

Thank you for answer, but I guess the question was related to "pure" python without Databricks – Consort 15/2, 2021 at 14:55

@Consort No it wasn't, it has a "databricks" tag – Oman 16/2, 2021 at 10:1

I guess the tag with databricks is there by an error and should be removed, delta lake is configured in databricks out of box - docs.databricks.com/delta/intro-notebooks.html You need to temper with "spark.jars.packages' when you are setting up the spark on your local machine for instance. – Consort 19/2, 2021 at 6:0

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2025 — McMap. All rights reserved.