How to choose between Azure data lake analytics and Azure Databricks
Asked Answered
S

2

22

Azure data lake analytics and azure databricks both can be used for batch processing. Could anyone please help me understand when to choose one over another?

Synthesis answered 22/5, 2018 at 11:48 Comment(0)
N
33

In my humble opinion, a lot of it comes down to existing skillsets. If you have a team experienced in Spark, Java, Python, r or Scala then Databricks is a natural fit. If on the other hand you have a team with existing SQL and c# skills, then the learning curve for them with U-SQL will be less steep.

That aside, there are other questions which can drive out differences:

  • Do you require realtime interaction (Databricks) or batch mode analytics (both)? Although there is a feedback item for real-time interactivity for U-SQL, please vote.
  • Do you want a pay-as-you-go model (U-SQL) or clusters with auto-terminate after a certain period (Databricks)?
  • Do you like working in a notebook (Databricks) or Visual Studio / VSCode / Powershell / .net sdk (U-SQL) method?
  • Do you want to use Spark libraries like GraphX (Databricks)?
  • Do you want the ability to run and scale any runtime (U-SQL)? See here for more details.
  • Do you want a local development emulator (U-SQL)? The U-SQL emulator in Visual Studio is seamless, ie you develop your code against your local drives in the same structure as your lake (for free), then simply click the drop-down in Visual Studio to run in the cloud. Although I think you can have a local Spark environment, I'm not sure what the local (and disconnected) development experience is for Databricks.
  • Are you using ADLS Gen 2 (only Databricks)? See here.

UPDATE October 2018: As far as I am aware, U-SQL does not currently support ADLS Gen 2, which would count against it (happy to be corrected). I will update the post if and when that support is added.

UPDATE January 2019: U-SQL has not had any meaningful updates since Spring 2018.

HTH

Niemann answered 22/5, 2018 at 13:43 Comment(6)
+1 for a detailed answer. All of them make sense, but architecturally or on the performance side or on capability-wise, what are the differences?Synthesis
Excellent answer. @Niemann Where do you think HDInsight fits into the mix here? In what scenario would I want to use one over the other.Colorcast
Hi, nice summary there's a user voice ticket for ADLS Gen 2 support if you wish to vote: feedback.azure.com/forums/327234-data-lake/suggestions/…Rimose
@wBob: Do you have any new about uSQL and ADLS Gen 2?Valvular
@Niemann : Is there any limitation to use ADLS and ADF Gen ?Valvular
Azure Data Lake Gen 2 does not support U-SQL.Niemann
O
5

Databricks has more language options that allows professional with different skills to work on the data. Also with databricks you can run jobs with high-performance, in-memory clusters.

In a project, we use data lake more as a storage, and do all the jobs (ETL, analytics) via databricks notebook. Storing data in data lake is cheaper $.

Back to your questions, if a complex batch job, and different type of professional will work on the data you. You may choose a Azure Data Lake + Databricks architecture. Otherwise an Azure Data Lake would satisfied your needs.

Take a look of these 2 articles would help. https://databricks.com/glossary/data-lake https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/

Oilcup answered 11/3, 2019 at 15:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.