How to execute parallel jobs in oozie
Asked Answered
C

2

5

I have a shell script in HDFS. I have scheduled this script in oozie with the following workflow.

Workflow:

<workflow-app name="Shell_test" xmlns="uri:oozie:workflow:0.5">
<start to="shell-8f63"/>
<kill name="Kill">
    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="shell-8f63">
    <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <exec>shell.sh</exec>
        <argument>${input_file}</argument>
        <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
        <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
        <file>/user/xxxx/args/${input_file}#${input_file}</file>
    </shell>
    <ok to="End"/>
    <error to="Kill"/>
</action>
<end name="End"/>

job properties

nameNode=xxxxxxxxxxxxxxxxxxxx
jobTracker=xxxxxxxxxxxxxxxxxxxxxxxx
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/xxxxxxx/xxxxxx 

args file

tableA
tableB
tablec
tableD

Now the shell script runs for single job name in args file. How can I schedule this shell script to run in parallel.

I want the script to run for 10 jobs at the same time.

What are the steps needed to do so. What changes should I make to the workflow.

Should I create 10 workflows for running 10 parallel jobs. Or what are the best scenarios to deal with this issue.

My shell script:

#!/bin/bash

[ $# -ne 1 ] && { echo "Usage : $0 table ";exit 1; }

table=$1

job_name=${table}

sqoop job  --exec ${job_name}

My sqoop job script:

sqoop job --create ${table} -- import --connect ${domain}:${port}/${database} --username ${username} --password ${password} --query "SELECT * from ${database}.${table} WHERE \$CONDITIONS" -m 1 --hive-import --hive-database ${hivedatabase} --hive-table ${table} --as-parquetfile --incremental append --check-column id --last-value "${last_val}"  --target-dir /user/xxxxx/hive/${hivedatabase}.db/${table} --outdir /home/$USER/logs/outdir
Calvados answered 17/4, 2017 at 17:44 Comment(2)
Are they 10 always? Why not use fork and join?Sixtyfour
Then you should handle that in the script itself. What is this script is about?Sixtyfour
W
7

To run the job parallel you can make workflow.xml with forks in it. See the below example which will help you.

If you notice the XML below you will see that I'm using the same script by passing different config file where in your case you have to pass the different table names you want from the config file or you can also pass by in your workflow.XML

Taking sqoop job as example, your sqoop should be in the .sh script as below:

sqoop job --create ${table} -- import --connect ${domain}:${port}/${database} --username ${username} --password ${password} --query "SELECT * from "${database}"."${table}" WHERE \$CONDITIONS" -m 1 --hive-import --hive-database "${hivedatabase}" --hive-table "${hivetable}" --as-parquetfile --incremental append --check-column id --last-value "${last_val}"  --target-dir /user/xxxxx/hive/${hivedatabase}.db/${table} --outdir /home/$USER/logs/outdir

So basically you will write your sqoop job as generic as you can where it should expect hive table, database, source table, source database names from the workflow.xml. That way you will call the same script for all the actions but Env-var in the workflow actions will change. See the below changes I made to the first action.

 <workflow-app xmlns='uri:oozie:workflow:0.5' name='Workflow_Name'>
    <start to="forking"/>
     
     <fork name="forking">
      <path start="shell-8f63"/>
      <path start="shell-8f64"/>
      <path start="SCRIPT3CONFIG3"/>
      <path start="SCRIPT4CONFIG4"/>
      <path start="SCRIPT5CONFIG5"/>
      <path start="script6config6"/>
    </fork>

    <action name="shell-8f63">
    <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <exec>shell.sh</exec>
        <argument>${input_file}</argument>
		<env-var>database=sourcedatabase</env-var>
	<env-var>table=sourcetablename</env-var>
	<env-var>hivedatabase=yourhivedataabsename</env-var>
	<env-var>hivetable=yourhivetablename</env-var>
	<env-var>You can pass how many ever variables you want between the env-var</env-var>
	<env-var>parameters should be passed with double quotes in order to work through shell actions</env-var>
	<env-var></env-var> 
        <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
        <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
        <file>/user/xxxx/args/${input_file}#${input_file}</file>
    </shell>	 
     <ok to="joining"/>
     <error to="sendEmail"/>
     </action>

    <action name="shell-8f64">
   <shell xmlns="uri:oozie:shell-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <exec>shell.sh</exec>
        <argument>${input_file}</argument>
		<env-var>database=sourcedatabase1</env-var>
	<env-var>table=sourcetablename1</env-var>
	<env-var>hivedatabase=yourhivedataabsename1</env-var>
	<env-var>hivetable=yourhivetablename2</env-var>
	<env-var>You can pass how many ever variables you want between the env-var</env-var>
	<env-var>parameters should be passed with double quotes in order to work through shell actions</env-var>
	<env-var></env-var> 
        <env-var>HADOOP_USER_NAME=${wf:user()}</env-var>
        <file>/user/xxxx/shell_script/lib/shell.sh#shell.sh</file>
        <file>/user/xxxx/args/${input_file}#${input_file}</file>
    </shell>
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="SCRIPT3CONFIG3">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/THIRD_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="SCRIPT4CONFIG4">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/FOURTH_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="SCRIPT5CONFIG5">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/FIFTH_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <action name="script6config6">
    <shell xmlns="uri:oozie:shell-action:0.1">
    <job-tracker>${jobTracker}</job-tracker>
    <name-node>${nameNode}</name-node>
    <configuration>
    <property>
    <name>mapred.job.queue.name</name>
    <value>${queueName}</value>
    </property>
    </configuration>
    <exec>COMMON_SCRIPT_YOU_WANT_TO_USE.sh</exec>
    <argument>SQOOP_2</argument>
    <env-var>UserName</env-var>
    <file>${nameNode}/${projectPath}/COMMON_SCRIPT_YOU_WANT_TO_USE.sh#COMMON_SCRIPT_YOU_WANT_TO_USE.sh</file>
    <file>${nameNode}/${projectPath}/SIXTH_CONFIG</file>

    </shell>	 
    <ok to="joining"/>
    <error to="sendEmail"/>
    </action>

    <join name="joining" to="end"/>

    <action name="sendEmail">
    <email xmlns="uri:oozie:email-action:0.1">
    <to>youremail.com</to>
    <subject>your subject</subject>
    <body>your email body</body>
    </email>
    <ok to="kill"/>
    <error to="kill"/>
    </action>
     
    <kill name="kill">
    <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
    </workflow-app>

I have shown you the 6 parallel jobs example above if you want to run parallel actions you can add few more at the beginning and write the actions in the workflow.

Here is the how it looks from HUE

enter image description here

Wether answered 19/4, 2017 at 14:18 Comment(5)
thanks Rob! Btw, the Oozie docs for fork/join are hereHomeroom
You can split into different argument files or easy way is to add it as <env-variable> within the action something like <env-var>TableName = sometablename</env-var> and your actual script should look for that parameter which is something like "${TableName }" (Make sure you are passing in double quotes thats syntax for shell action)Wether
So I understand that you have a common sqoop job script which should be used for grabbing different table ?? Am I correct ??Wether
@ - Active_user Can you show us your sqoop job, I was wondering whether if you are selecting some specific columns to import or the whole table from source?Wether
Please see my edited answer @Active_user If it helps you please up my answer if it works in your case tick mark it.Wether
O
0

From what I understand, you have a requirement to run 'x' number of jobs in parallel in Oozie. This 'x' may vary eachtime. What you can do is,

Have a workflow with 2 actions.

  1. Shell Action
  2. SubWorkflow Action

  1. Shell Action - This will run a shell script which will based on your 'x' dynamically decide which table you need to pick etc. and create an .xml which will serve as a workflow xml for the subworkflow action that is next. This subworkflow action will have 'fork' shell jobs to enable them to run in parallel. Note that you will need to put this xml in HDFS as well inorder for it to be available for your subworkflow.

  2. Subworkflow Action - It will merely execute the workflow created in previous action.

Orthotropous answered 18/4, 2017 at 5:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.