Can one use Delta Lake and not being dependent on Databricks Runtime? (I mean, is it possible to use delta-lake with hdfs and spark on prem only?) If no, could you elaborate why is that so from technical point of view?
Yes, delta lake has been open sourced by databricks (https://delta.io/). I am using deltalake(0.6.1) along with apache spark(2.4.5) & S3. Many other integrations are also available to accommodate existing tech stack e.g. integration of hive, presto, athena etc. Connectors:https://github.com/delta-io/connectors Integrations: https://docs.delta.io/latest/presto-integration.html & https://docs.delta.io/latest/integrations.html
According to this https://vimeo.com/338100834, it is possible to use Delta Lake without Databricks Runtime. Delta Lake is just a lib which "knows" how to write and read transactionally into the table (a collection of parquet files) by maintaining a special transaction log besides each table. Of course, a special connector for external applications (e.g. hive) is needed in order to work with such tables. Otherwise, transactional and consistency guarantees cannot be enforced.
Now you can use delta-rs to read and write to Delta Lake directly.
It supports Rust and Python. Here is the Python example:
You can install by pip install deltalake
or conda install -c conda-forge delta-spark
.
import pandas as pd
from deltalake.writer import write_deltalake
df = pd.DataFrame({"x": [1, 2, 3]})
write_deltalake("path/to/delta-tables/table1", df)
Writing to S3
storage_options = {
"AWS_DEFAULT_REGION": "us-west-2",
"AWS_ACCESS_KEY_ID": "xxx",
"AWS_SECRET_ACCESS_KEY": "xxx",
"AWS_S3_ALLOW_UNSAFE_RENAME": "true",
}
write_deltalake(
"s3a://my-bucket/delta-tables/table1",
df,
mode="append",
storage_options=storage_options,
)
To remove AWS_S3_ALLOW_UNSAFE_RENAME
and concurrently write, it needs DynamoDB lock.
Follow this GitHub ticket for more updates regarding how to set up correctly.
According to documentation: https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake, delta lake has been open-sourced to use with Apache Spark. The integration can be done easily by adding delta lake jar to the code or adding the library to the spark installation path. Hive integration can be done using: https://github.com/delta-io/connectors.
© 2022 - 2024 — McMap. All rights reserved.