Delta Lake without Databricks Runtime

Asked 23/3, 2020 at 16:5 Answered 12/5, 2023 at 4:9

Can one use Delta Lake and not being dependent on Databricks Runtime? (I mean, is it possible to use delta-lake with hdfs and spark on prem only?) If no, could you elaborate why is that so from technical point of view?

Marko answered 23/3, 2020 at 16:5 Comment(0)

Yes, delta lake has been open sourced by databricks (https://delta.io/). I am using deltalake(0.6.1) along with apache spark(2.4.5) & S3. Many other integrations are also available to accommodate existing tech stack e.g. integration of hive, presto, athena etc. Connectors:https://github.com/delta-io/connectors Integrations: https://docs.delta.io/latest/presto-integration.html & https://docs.delta.io/latest/integrations.html

Sima answered 18/6, 2020 at 14:15 Comment(2)

Can you use delta lake with other query engines like Presto or Athena? – Heiner 26/6, 2020 at 15:49

Yes, Now it can be used with latest releases. You can check more details here docs.delta.io/latest/presto-integration.html – Sima 26/6, 2020 at 20:14

According to this https://vimeo.com/338100834, it is possible to use Delta Lake without Databricks Runtime. Delta Lake is just a lib which "knows" how to write and read transactionally into the table (a collection of parquet files) by maintaining a special transaction log besides each table. Of course, a special connector for external applications (e.g. hive) is needed in order to work with such tables. Otherwise, transactional and consistency guarantees cannot be enforced.

Marko answered 26/3, 2020 at 13:36 Comment(0)

Now you can use delta-rs to read and write to Delta Lake directly.

It supports Rust and Python. Here is the Python example:

You can install by pip install deltalake or conda install -c conda-forge delta-spark.

import pandas as pd
from deltalake.writer import write_deltalake

df = pd.DataFrame({"x": [1, 2, 3]})
write_deltalake("path/to/delta-tables/table1", df)

Writing to S3

storage_options = {
    "AWS_DEFAULT_REGION": "us-west-2",
    "AWS_ACCESS_KEY_ID": "xxx",
    "AWS_SECRET_ACCESS_KEY": "xxx",
    "AWS_S3_ALLOW_UNSAFE_RENAME": "true",
}

write_deltalake(
    "s3a://my-bucket/delta-tables/table1",
    df,
    mode="append",
    storage_options=storage_options,
)

To remove AWS_S3_ALLOW_UNSAFE_RENAME and concurrently write, it needs DynamoDB lock.

Follow this GitHub ticket for more updates regarding how to set up correctly.

Chink answered 12/5, 2023 at 4:9 Comment(0)

According to documentation: https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake, delta lake has been open-sourced to use with Apache Spark. The integration can be done easily by adding delta lake jar to the code or adding the library to the spark installation path. Hive integration can be done using: https://github.com/delta-io/connectors.

Bellows answered 3/4, 2020 at 9:56 Comment(0)

Writing to S3

Recommended topics

Hot tags