Oozie fork kills all actions when one is killed
Asked Answered
H

4

9

I use fork/join in Oozie, in order to parallel some sub-workflow actions. My workflow.xml looks like this:

<workflow-app name="myName" xmlns="uri:oozie:workflow:0.5"
<start to="fork1"/>
<kill name="Kill">
    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>

<fork name="fork1">
    <path start="subworkflow1"/>
    <path start="subworkflow2"/>
</fork>
<join name="Completed" to="End"

<action name="subworkflow1">
    <sub-workflow>
        <app-path>....</app-path>
        <propagate-configuration/>
        <configuration>
            <property>
                <name>....</name>
                <value>....</value>
            </property>
        </configuration>
    </sub-workflow>
    <ok to="Completed"/>
    <error to="Completed"/>
</action>

<action name="subworkflow2">
    <sub-workflow>
        <app-path>....</app-path>
        <propagate-configuration/>
        <configuration>
            <property>
                <name>....</name>
                <value>....</value>
            </property>
        </configuration>
    </sub-workflow>
    <ok to="Completed"/>
    <error to="Completed"/>
</action>

<end name="End"></workflow-app>

When subworkflow1 is killed (failed for some reason), It kills subworkflow2 also. I want those two actions to be parallel, but not dependent.

In my workflow, when workflow1 is killed, I see that workflow2 is also killed, but my app succeeded (I check it on Oozie dashboard -> workflows in HUE).

In this case I want that subworkflow1 will be killed, subworkflow2 will succeed, and I don't really care what my entire app will say.

  • In my case, subworkflow1 takes longer than subworkflow2, so when I checked my app when it ended, I saw that although it says that subworkflow1+2 were killed, and my app succeeded, what really happened is that subworkflow2 finished its part and even though, it was killed later (it keeps 'running' until all the paths of the fork finish their run). So workflow2 finished its part and than was killed because workflow1 was killed...

What should I do to make each path to get it's own status and continue running even though other path in the same fork is killed?

Howbeit answered 8/7, 2015 at 12:13 Comment(0)
E
8

I have recently run into this issue also. Found a way to get oozie to behave how I want.

Your forked actions can have an error-to value equal to your join name. This will skip any subsequent action in that particular forked execution path. Then, your join's "to" value can send control to a decision node. That decision node should check value of wf:lastErrorNode(). If the value is empty string, continue on processing the workflow as needed. If the value is not empty string, then an error occurred and your can send control to kill node.

Here's an example:

<start to="forkMe"/>
<fork name="forkMe">
    <path start="action1"/>
    <path start="action2"/>
</fork>
<action name="action1">
    ...
    <ok to="joinMe"/>
    <error to="joinMe"/>
</action>
<action name="action1">
    ...
    <ok to="joinMe"/>
    <error to="joinMe"/>
</action>
<join name="joinMe" to="decisionMe"/>
<decision name="decisionMe">
  <switch>
     <case to="end">
        ${wf:lastErrorNode() eq ""}
     </case>
     <default to="error-mail"/>
 </switch>
</decision>
<action name="error-mail">
    ...
    <ok to="fail"/>
    <error to="fail"/>
</action>
<kill name="fail">
    <message>Job failed:
        message[${wf:errorMessage(wf:lastErrorNode())}]
    </message>
</kill>
<end name="end"/>
Eatage answered 29/1, 2016 at 18:42 Comment(1)
Using a decision node like this means you won't be able to re-run your workflow if it fails using -Doozie.wf.rerun.failnodes=true because the decision node will have already run in the first run and will be marked as succeeded, so the job status after re-running will still be KILLED because the decision node will not be re-evaluated, and any nodes which haven't yet been run after the decision node will not be executed. You'd have to use skip nodes when re-running to get around this, which is cumbersome when you have a lot of nodes to skip because you have to specify every single one.Clinton
L
0

A few ways to handle it.

1) You can submit these 2 sub workflow independent instead of a big workflow contains them.

2) add retry to sub-workflow 1, sub-workflow 2 won't be killed before sub-workflow 1 failed at the last time. If you set a long retry interval, 2 is already finished when 1 failed the last time, and 2's status will remain ok. kill will not affect actions whose status is ok.

And to this question. In my workflow, when workflow1 is killed, I see that workflow2 is also killed, but my app succeeded (I check it on Oozie dashboard -> workflows in HUE).

A: <error to="Completed"/> (If completed node do not reach a kill node eventually) this setting will let oozie consider workflows finished successfully even error in happen this action.

Lochia answered 9/7, 2015 at 8:19 Comment(0)
R
0

I handled this issue by setting the forked actions 'error-to' to join node. Then the join node is set to follow up with ssh node.

<action name="ssh-50c1">
    <ssh xmlns="uri:oozie:ssh-action:0.1">
        <host>${SSH_USER_HOST}</host>
        <command>${wf:lastErrorNode() eq null}</command>
        <capture-output/>
    </ssh>
    <ok to="End"/>
    <error to="Kill"/>
</action>

The alternative with shell node might be possible and more appropriate, but it did not work for me. Also, you could accomplish the same thing with decision node (checking the wf:lastErrorNode()), but then you would suffer the trouble with retrying workflows because Decision nodes will be marked as successfull even after fail.

Regurgitation answered 11/6, 2019 at 13:20 Comment(1)
So, whats the question?Exuberant
K
0

In addition to Jeffrey B's answer, there is a way to do a failNode rerun.

You can use a shell action instead of a decision node.

workflow.xml

<start to="forkMe"/>
<fork name="forkMe">
    <path start="action1"/>
    <path start="action2"/>
</fork>
<action name="action1">
    ...
    <ok to="joinMe"/>
    <error to="joinMe"/>
</action>
<action name="action1">
    ...
    <ok to="joinMe"/>
    <error to="joinMe"/>
</action>
<join name="joinMe" to="decisionMe"/>
<action name="decisionMe">
    <shell xmlns="uri:oozie:shell-action:0.3">
        <exec>error_check.sh</exec>
        <argument>${wf:lastErrorNode()}</argument>
        <file>/path/to/script/error_check.sh</file>
    </shell>
    <ok to="end"/>
    <error to="error-mail"/>
</action>
<action name="error-mail">
    ...
    <ok to="fail"/>
    <error to="fail"/>
</action>
<kill name="fail">
    <message>Job failed:
        message[${wf:errorMessage(wf:lastErrorNode())}]
    </message>
</kill>
<end name="end"/>

error_check.sh

#!/usr/bin/env bash

errorNode=$1

if [[ -z "$errorNode" ]]; then
  exit 0
else
  echo "Error Node : $errorNode !"
  exit 1
fi
Kinelski answered 11/10, 2022 at 3:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.