I have been exploring the data lakehouse concept and Delta Lake. Some of its features seem really interesting. Right there on the project home page https://delta.io/ there is a diagram showing Delta Lake running on "your existing data lake" without any mention of Spark. Elsewhere it suggests that Delta Lake indeeds runs on top of Spark. So my question is, can it be run independently from Spark? Can I, for example, set up Delta Lake with S3 buckets for storage in Parquet format, schema validation etc, without using Spark in my architecture?
You might keep an eye on this: https://github.com/delta-io/delta-rs
It's early and currently read-only, but worth watching as the project evolves.
Currently, you can use delta-rs to read and write to Delta Lake directly.
It support Rust and Python. Here is an example using Python:
You can install by pip install deltalake
or conda install -c conda-forge delta-spark
.
import pandas as pd
from deltalake.writer import write_deltalake
df = pd.DataFrame({"x": [1, 2, 3]})
write_deltalake("path/to/delta-tables/table1", df)
Writing to S3
storage_options = {
"AWS_DEFAULT_REGION": "us-west-2",
"AWS_ACCESS_KEY_ID": "xxx",
"AWS_SECRET_ACCESS_KEY": "xxx",
"AWS_S3_ALLOW_UNSAFE_RENAME": "true",
}
write_deltalake(
"s3a://my-bucket/delta-tables/table1",
df,
mode="append",
storage_options=storage_options,
)
To remove AWS_S3_ALLOW_UNSAFE_RENAME
and concurrently write, it needs DynamoDB lock.
Follow this GitHub ticket for more updates regarding how to set up correctly.
Yes, this is absolutely possible. We had built scalable data backend using this approach of Delta Lake, Glue data catalog, Amazon S3 and Amazon Athena. Amazon Athena can be used to query the data instead of Apache Spark.
Please refer to this blog that explains the same in detail.
tl;dr No
Delta Lake up to and including 0.8.0 is tightly integrated with Apache Spark so it's impossible to have Delta Lake without Spark.
© 2022 - 2024 — McMap. All rights reserved.