Can anyone please suggest which is best suited scheduler for Hadoop. If it is oozie. How is oozie different from cron jobs.
Oozie is the best option.
Oozie Coordinator allows triggering actions when files arrive at HDFS. This will be challenging to implement anywhere else.
Oozie gets callbacks from MapReduce jobs so it knows when they finish and whether they hang without expensive polling. No other workflow manager can do this.
There are some benefits over crontab or any other, pointing some links
Oozie is able to start jobs on data availability, this is not free since someone has to say when the data are available. Oozie allows you to build complex workflow using the mouse. Oozie allows you to schedule workflow execution using the coordinator. Oozie allows you to bundle one or more coordinators.
Using cron on hadoop is a bad idea but it's still fast, reliable, well known. Most of work which is free on oozie has to be coded if you are going to use cron.
Using oozie without Java means ( at the current date ) to meet a long list of dependency problem. If you are a Java programmer oozie is a must.
Cron is still a good choice when you are in the test/verify stage.
Oozie separates specifications for workflow and schedule into a workflow specification and a coordinator specification, respectively. Coordinator specifications are optional, only required if you want to run a job repeatedly on a schedule. By convention you usually see workflow specifications in a file called workflow.xml and a coordinator specification in a file called coordinator.xml. The new cron-like scheduling affects these coordinator specifications. Let’s take a look at a coordinator specification that will cause a workflow to be run every weekday at 2 AM.
[xml]
<coordinator-app name="weekdays-at-two-am"
frequency="0 2 * * 2-6"
start="${start}" end="${end}" timezone="UTC"
xmlns="uri:oozie:coordinator:0.2">
<action>
<workflow>
<app-path>${workflowAppUri}</app-path>
<configuration>
<property>
<name>jobTracker</name>
<value>${jobTracker}</value>
</property>
<property>
<name>nameNode</name>
<value>${nameNode}</value>
</property>
<property>
<name>queueName</name>
<value>${queueName}</value>
</property>
</configuration>
</workflow>
</action>
</coordinator-app>
[/xml]
The key thing here is the frequency attribute in the coordinator-app element, here we see a cron-like specification that instructs Oozie when to run the workflow. The value for is specified in another properties file. The specification is “cron-like” and you might notice one important difference, days of the week are numbered 1-7 (1 being Sunday) as opposed to the 0-6 numbering used in standard cron.
For info visit:http://hortonworks.com/blog/new-in-hdp-2-more-powerful-scheduling-options-in-oozie/
Apache oozie is built to work with yarn and hdfs.
There are many features like data dependency, coordinator, workflow actions provided by oozie. Oozie documentation
I think oozie is the best option
Sure you can use cron. But you will have to take lot of efforts to work with hadoop.
© 2022 - 2024 — McMap. All rights reserved.