Apache Spark + Delta Lake concepts
Asked Answered
J

2

23

I have many doubts related to Spark + Delta. enter image description here

1) Databricks propose 3 layers (bronze, silver, gold), but in which layer is recommendable to use for Machine Learning and why? I suppose they propose to have the data clean and ready in the gold layer.

2) If we abstract the concepts of these 3 layers, can we think the bronze layer as a Data Lake, the silver layer as databases, and the gold layer as a data warehouse? I mean in terms of functionality, .

3) Delta architecture is a commercial term, or is an evolution of Kappa Architecture, or is a new trending architecture as Lambda and Kappa architecture? What are the differences between (Delta + Lambda Architecture) versus Kappa Architecture?

4) In many cases Delta + Spark scale a lot more than most databases for usually much cheaper, and if we tune things right, we can get almost 2x faster queries results. I know is pretty complicated to compare the actual trending data warehouses versus the Feature/Agg Data Store, but I would like to know how can I make this comparison?

5) I used to use Kafka, Kinesis, or Event Hub for streaming process, and my question is what kind of problems can happens if we replace these tools by a Delta Lake table (I already know that everything depends of many things, but I would like to have a general vision of that).

Jew answered 19/5, 2019 at 19:20 Comment(0)
A
21

1) Leave it up to your data scientists. They should be comfortable working in the silver and gold regions, some more advanced data scientists will want to go back to raw data and parse out additional information that may not have been included in the silver/gold tables.

2) Bronze = raw data in native format/delta lake format. Silver = sanitized and cleaned data in delta lake. Gold = data that is accessed via the delta lake or pushed to a data warehouse, depending on business requirements.

3) Delta architecture is an easy version of lambda architecture. Delta architecture is a commercial term at this point, we'll see if that changes in the future.

4) Delta Lake + Spark is the most scalable data storage mechanism with a reasonable price. You're welcome to test the performance based on your business requirements. Delta lake will be far cheaper than any data warehouse for storage. Your requirements around data access and latency will be the larger question.

5) Kafka, Kinesis or Eventhub are sources for getting data from the edge to the data lake. Delta lake can act as a source and sink to a streaming application. There are actually very few problems using delta as a source. The delta lake source lives on blob storage so we actually get around many problems of the infrastructure issues, but add the consistentcy issues of the blob storage. Delta lake as a source of streaming jobs is way more scalable than a kafka/kinesis/event hub, but you still need those tools to get data from the edge into the delta lake.

Armpit answered 19/5, 2019 at 23:17 Comment(7)
What are the difference between Kappa and Delta Architecture? Have you an idea of what requirements around data access and latency I can investigate for making a comparison? Why we still need tools as kafka/kinesis/event hub?Jew
I havent used Kappa architecture so I'm no authority to give an opinion. Delta architecture allows you to do streaming, batch or both. The reason for Kafka/Kinesis/Event Hub is that you typically want some flexible message queue to push data from the data producers (like your cell phone) to some sort of event bus/hub before ingesting.Armpit
In the 5) you talked about the consistency issues, and Delta Lake documentation says that they offers ACID (Consistency), so that isn't true?Ruhr
These are separate. There is eventual consistency on blob storage. And there is consistency when reading/writing data. Delta Lake is currently only ready for hdfs. See the requirements for underlying storage systems here for more info: github.com/delta-io/deltaArmpit
Delta Lake released 0.2.0 which has support to cloud storages Amazon S3 and Azure Blob Storage with improved concurrency.Lithium
How do you add date from a gold table to Azure SQL. As far as inserting new records this can be done with bulkCopyToSqlDB. But how do you deal with updates?Batrachian
I would like to know how fast the databricks & open source versions will deploy, with the OPTIMIZE functionalities?Ulick
C
3
  1. The medallion tables are a recommendation based on how our customers are using Delta lake. You do not have to follow it exactly; however, it does align nicely to how people design EDW's. As for machine learning and which table to use: That is going to be a choice by the folks doing machine learning. Some may want to access the Bronze tables because that is the raw data, nothing has been done to it. Others may want the Silver table because it is presumed to be clean albeit augmented. Usually the Gold tables are highly refined and specific to answering well-defined business questions.

  2. Not exactly. The Bronze tables are the raw event data, e.g. one row per event or measurement, etc. The Silver tables are also at the event/measurement level, but they are highly-refined and are ready for queries, reporting, dashboards etc. Gold tables can be fact and dimension tables, aggregate tables, or curated data sets. It is important to remember that Delta is not meant to be used as a transactional, OLTP system. It is really meant for OLAP workloads.

  3. Delta architecture is the name we gave a particular implementation of Delta Lake. It is not a commercial term per se but hopefully it becomes one. There is enough information out there to compare and contrast Kappa and Lambda architectures. The Delta architecture is well-defined throughout Delta documentation and Databricks blogs, tech talks, YouTube videos, etc.

  4. I would ask exactly what it is you want to compare. Speed, features, products, ...?

  5. Delta Lake is not trying to replace any messaging pub/sub systems, they have different use cases. Delta Lake can connect to each of the products you mention both as a subscriber and publisher. Don't forget that Delta Lake is an open storage layer that brings ACID-compliant transactions, high performance, and high reliability to data lakes.

Louis.

Changeling answered 14/7, 2020 at 12:49 Comment(4)
I would like to know how fast the databricks versions will deploy, with the OPTIMIZE functionalities ?Ulick
What do you mean by "... deploy, with the optimize functionalities?"Changeling
Big Lou , docs.databricks.com/delta/optimizations/file-mgmt.htmlUlick
Cristian, the time it takes for the optimization (compaction) process to run is dependent upon a few factors: 1. Overall size of data being optimized, 2. The number of Delta files being compacted, and 3. The size and makeup of the cluster running the optimization.Changeling

© 2022 - 2025 — McMap. All rights reserved.