Azure data lake analytics and azure databricks both can be used for batch processing. Could anyone please help me understand when to choose one over another?
In my humble opinion, a lot of it comes down to existing skillsets. If you have a team experienced in Spark, Java, Python, r or Scala then Databricks is a natural fit. If on the other hand you have a team with existing SQL and c# skills, then the learning curve for them with U-SQL will be less steep.
That aside, there are other questions which can drive out differences:
- Do you require realtime interaction (Databricks) or batch mode analytics (both)? Although there is a feedback item for real-time interactivity for U-SQL, please vote.
- Do you want a pay-as-you-go model (U-SQL) or clusters with auto-terminate after a certain period (Databricks)?
- Do you like working in a notebook (Databricks) or Visual Studio / VSCode / Powershell / .net sdk (U-SQL) method?
- Do you want to use Spark libraries like GraphX (Databricks)?
- Do you want the ability to run and scale any runtime (U-SQL)? See here for more details.
- Do you want a local development emulator (U-SQL)? The U-SQL emulator in Visual Studio is seamless, ie you develop your code against your local drives in the same structure as your lake (for free), then simply click the drop-down in Visual Studio to run in the cloud. Although I think you can have a local Spark environment, I'm not sure what the local (and disconnected) development experience is for Databricks.
- Are you using ADLS Gen 2 (only Databricks)? See here.
UPDATE October 2018: As far as I am aware, U-SQL does not currently support ADLS Gen 2, which would count against it (happy to be corrected). I will update the post if and when that support is added.
UPDATE January 2019: U-SQL has not had any meaningful updates since Spring 2018.
HTH
Databricks has more language options that allows professional with different skills to work on the data. Also with databricks you can run jobs with high-performance, in-memory clusters.
In a project, we use data lake more as a storage, and do all the jobs (ETL, analytics) via databricks notebook. Storing data in data lake is cheaper $.
Back to your questions, if a complex batch job, and different type of professional will work on the data you. You may choose a Azure Data Lake + Databricks architecture. Otherwise an Azure Data Lake would satisfied your needs.
Take a look of these 2 articles would help. https://databricks.com/glossary/data-lake https://visualbi.com/blogs/microsoft/azure/etl-azure-databricks-vs-data-lake-analytics/
© 2022 - 2024 — McMap. All rights reserved.