Which one to choose Apache Oozie or Apache Airflow? Need a comparison
Asked Answered
C

1

26

I am new to job schedulers and was looking out for one to run jobs on big data cluster. I was quite confused with the available choices. Found Oozie to have many limitations as compared to the already existing ones such as TWS, Autosys, etc.

Need some comparison points on Oozie vs. Airflow.

Appreciate your help.

Cisco answered 21/12, 2017 at 16:25 Comment(0)
G
38

In my experience Airflow is the best data pipeline right now. It's best suited for managing complex, long running workflows. UI and modularity are over the top.

Airflow

  • + Python Code for DAGs
  • + Has connectors for every major service/cloud provider
  • + More versatile
  • + Advanced metrics
  • + Better UI and API
  • + Capable of creating extremely complex workflows
  • + Jinja Templating
  • + Can be used as an Orchestrator for the Tensorflow Extended ecosystem
  • = Can be parallelized
  • = Native Connections to HDFS, HIVE, PIG etc..
  • = Graph as DAG

Oozie

  • --- Java or XML for DAGs
  • - hard to build complex pipelines
  • - smaller, less active community
  • - worse WEB GUI
  • - Java API
  • = Can be parallelized
  • = Native Connections to HDFS, HIVE, PIG etc..
  • = Graph as DAG

As you see, Airflow is an easier to use (especially in large heteregenoeus team), more versatile and powerful option than Oozie.

As I said: go with Airflow.

Article you may find interesting

Gallaway answered 21/12, 2017 at 17:12 Comment(5)
Another point for Airflow: Google now offers a fully managed version of Airflow distributed using Kubernetes via their new product: ComposerOrobanchaceous
This looks to me as advertising response. Is really Java '-' ? What about groovy, jruby, jython... and other jvm based Lang's? To Mee looks better than python only. However python is nice lang. I can agree that it looks a little outdated, and see no point in that as for business it should not matterManyplies
If any other cloud provider steps up and offers something similar, I will update the comment, not having to manage your distributed clusters simplifies things by a long shot. While Python is unequivocally easier for people to pick up, easier to read and less verbose to write but its real strength is the direct access to the most used data science library. I am not saying that Java is inferior to Python however in this specific use case Python does make things easier.Orobanchaceous
I use Oozie more for Data-Eng/Sc projects on Hadoop/Spark. For Python, we can use bashscript as shell action in Oozie and then let bash does all Python stuff. :)Beal
I'm not that familiar with Airflow, but I can add a few more things to consider: - Have you seen the Fluent API of Oozie ? It can be used to build complex pipelines. - You can use HUE as a Web UI github.com/cloudera/hue - Do you need to handle timezones? - How do you create Oozie like bundles? - How do you implement HA for the Airflow scheduler? SPoF? - Oozie is used by many companies for large scale dataprocessing. - Oozie was designed for Hadoop. What about delegation tokens in Airflow? - SLA for coordinators & workflows?Suellensuelo

© 2022 - 2024 — McMap. All rights reserved.