Wait for kubernetes job to complete on either failure/success using command line
Asked Answered
A

6

49

What is the best way to wait for kubernetes job to be complete? I noticed a lot of suggestions to use:

kubectl wait --for=condition=complete job/myjob

but i think that only works if the job is successful. if it fails, i have to do something like:

kubectl wait --for=condition=failed job/myjob

is there a way to wait for both conditions using wait? if not, what is the best way to wait for a job to either succeed or fail?

Adria answered 9/3, 2019 at 2:31 Comment(0)
C
4

kubectl wait --for=condition=<condition name is waiting for a specific condition, so afaik it can not specify multiple conditions at the moment.

My workaround is using oc get --wait, --wait is closed the command if the target resource is updated. I will monitor status section of the job using oc get --wait until status is updated. Update of status section is meaning the Job is complete with some status conditions.

If the job complete successfully, then status.conditions.type is updated immediately as Complete. But if the job is failed then the job pod will be restarted automatically regardless restartPolicy is OnFailure or Never. But we can deem the job is Failed status if not to updated as Complete after first update.

Look the my test evidence as follows.

  • Job yaml for testing successful complete
    # vim job.yml
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pi
    spec:
      parallelism: 1
      completions: 1
      template:
        metadata:
          name: pi
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-wle", "exit 0"]
          restartPolicy: Never
  • It will show you Complete if it complete the job successfully.
    # oc create -f job.yml &&
      oc get job/pi -o=jsonpath='{.status}' -w &&
      oc get job/pi -o=jsonpath='{.status.conditions[*].type}' | grep -i -E 'failed|complete' || echo "Failed" 

    job.batch/pi created
    map[startTime:2019-03-09T12:30:16Z active:1]Complete
  • Job yaml for testing failed complete
    # vim job.yml
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pi
    spec:
      parallelism: 1
      completions: 1
      template:
        metadata:
          name: pi
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-wle", "exit 1"]
          restartPolicy: Never
  • It will show you Failed if the first job update is not Complete. Test if after delete the existing job resource.
    # oc delete job pi
    job.batch "pi" deleted

    # oc create -f job.yml &&
      oc get job/pi -o=jsonpath='{.status}' -w &&
      oc get job/pi -o=jsonpath='{.status.conditions[*].type}' | grep -i -E 'failed|complete' || echo "Failed" 

    job.batch/pi created
    map[active:1 startTime:2019-03-09T12:31:05Z]Failed

I hope it help you. :)

Cyst answered 9/3, 2019 at 13:39 Comment(3)
i ended up just making a simple script to check the status as you had shown: until [[ $SECONDS -gt $end ]] || [[ $(kubectl get jobs $job_name -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}') == "True" ]] || [[ $(kubectl get jobs $job_name -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}') == "True" ]]; doAdria
That's great, and I'm sorry for showing the openshift cli example. But you can catch up as kubernetes cli, it's great !Cyst
actually there is no --wait and -w does stand for --watchReneareneau
C
44

Run the first wait condition as a subprocess and capture its PID. If the condition is met, this process will exit with an exit code of 0.

kubectl wait --for=condition=complete job/myjob &
completion_pid=$!

Do the same for the failure wait condition. The trick here is to add && exit 1 so that the subprocess returns a non-zero exit code when the job fails.

kubectl wait --for=condition=failed job/myjob && exit 1 &
failure_pid=$!

Then use the Bash builtin wait -n $PID1 $PID2 to wait for one of the conditions to succeed. The command will capture the exit code of the first process to exit:

MAC USERS! Note that wait -n [...PID] requires Bash version 4.3 or higher. MacOS is forever stuck on version 3.2 due to license issues. Please see this Stackoverflow Post on how to install the latest version.

wait -n $completion_pid $failure_pid

Finally, you can check the actual exit code of wait -n to see whether the job failed or not:

exit_code=$?

if (( $exit_code == 0 )); then
  echo "Job completed"
else
  echo "Job failed with exit code ${exit_code}, exiting..."
fi

exit $exit_code

Complete example:

# wait for completion as background process - capture PID
kubectl wait --for=condition=complete job/myjob &
completion_pid=$!

# wait for failure as background process - capture PID
kubectl wait --for=condition=failed job/myjob && exit 1 &
failure_pid=$! 

# capture exit code of the first subprocess to exit
wait -n $completion_pid $failure_pid

# store exit code in variable
exit_code=$?

if (( $exit_code == 0 )); then
  echo "Job completed"
else
  echo "Job failed with exit code ${exit_code}, exiting..."
fi

exit $exit_code
Chaoan answered 18/2, 2020 at 17:35 Comment(4)
You can use if wait ... instead of storing exit code in a variable.Glover
@Glover You're right - in the original script I was using trap so I was using the exit code elsewhere.Chaoan
wait -n is not available on MacOS :(Passant
make sure you don't have "set -e" on! otherwise the wait -n command will exit straight away if the failure_pid wins, and you won't get your nice if statement logging. Apart from that, this approach worked perfectlySnazzy
C
11

You can leverage the behaviour when --timeout=0.

In this scenario, the command line returns immediately with either result code 0 or 1. Here's an example:

retval_complete=1
retval_failed=1
while [[ $retval_complete -ne 0 ]] && [[ $retval_failed -ne 0 ]]; do
  sleep 5
  output=$(kubectl wait --for=condition=failed job/job-name --timeout=0 2>&1)
  retval_failed=$?
  output=$(kubectl wait --for=condition=complete job/job-name --timeout=0 2>&1)
  retval_complete=$?
done

if [ $retval_failed -eq 0 ]; then
    echo "Job failed. Please check logs."
    exit 1
fi

So when either condition=failed or condition=complete is true, execution will exit the while loop (retval_complete or retval_failed will be 0).

Next, you only need to check and act on the condition you want. In my case, I want to fail fast and stop execution when the job fails.

Chapland answered 28/8, 2020 at 11:24 Comment(0)
P
9

The wait -n approach does not work for me as I need it to work both on Linux and Mac.

I improved on the answer provided by Clayton a little, because his script would not work with set -e -E enabled. The following will work even in that case.

while true; do
  if kubectl wait --for=condition=complete --timeout=0 job/name 2>/dev/null; then
    job_result=0
    break
  fi

  if kubectl wait --for=condition=failed --timeout=0 job/name 2>/dev/null; then
    job_result=1
    break
  fi

  sleep 3
done

if [[ $job_result -eq 1 ]]; then
    echo "Job failed!"
    exit 1
fi

echo "Job succeeded"

You might want to add a timeout to avoid the infinite loop, depends on your situation.

Passant answered 17/3, 2021 at 15:39 Comment(4)
I would ask why would it be not sufficient to just use set -e and that would identify for error command. Then I wont need to check for failed condition ?? @Martin MelkaPhaedra
When you call kubectl wait --for=condition=failed --timeout=0 job/name and the status of the pod is not failed, then that command exits with a nonzero exit code. With set -e enabled, that will cause the whole script to terminate. The logic here is that "while kubectl wait exits with nonzero code, keep polling it". We only want the script to exit when kubectl wait exits with a zero exit code, because that means the pod is either completed or failed.Passant
but pod status generally lands on "error" like 0/1 (error) so possibly in the first failed job set -e would exit the script out correct ?Phaedra
Sorry, I don't follow what you mean. The kubectl wait doesn't exit with the exit code of the pod status. kubectl wait --for=condition=complete --timeout=0 job/name will exit with 0 (success) if the pod is currently in a completed (successful) state. 1 (error) otherwise (that is, if the pod is currently still running/pending/failed/whatever). Similarly, kubectl wait --for=condition=failed --timeout=0 job/name will exit with 0 (success) if the pod is currently in a failed state. It's done this way because there is no kubectl cmd to do "exit when the pod is success or error".Passant
C
4

kubectl wait --for=condition=<condition name is waiting for a specific condition, so afaik it can not specify multiple conditions at the moment.

My workaround is using oc get --wait, --wait is closed the command if the target resource is updated. I will monitor status section of the job using oc get --wait until status is updated. Update of status section is meaning the Job is complete with some status conditions.

If the job complete successfully, then status.conditions.type is updated immediately as Complete. But if the job is failed then the job pod will be restarted automatically regardless restartPolicy is OnFailure or Never. But we can deem the job is Failed status if not to updated as Complete after first update.

Look the my test evidence as follows.

  • Job yaml for testing successful complete
    # vim job.yml
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pi
    spec:
      parallelism: 1
      completions: 1
      template:
        metadata:
          name: pi
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-wle", "exit 0"]
          restartPolicy: Never
  • It will show you Complete if it complete the job successfully.
    # oc create -f job.yml &&
      oc get job/pi -o=jsonpath='{.status}' -w &&
      oc get job/pi -o=jsonpath='{.status.conditions[*].type}' | grep -i -E 'failed|complete' || echo "Failed" 

    job.batch/pi created
    map[startTime:2019-03-09T12:30:16Z active:1]Complete
  • Job yaml for testing failed complete
    # vim job.yml
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: pi
    spec:
      parallelism: 1
      completions: 1
      template:
        metadata:
          name: pi
        spec:
          containers:
          - name: pi
            image: perl
            command: ["perl",  "-wle", "exit 1"]
          restartPolicy: Never
  • It will show you Failed if the first job update is not Complete. Test if after delete the existing job resource.
    # oc delete job pi
    job.batch "pi" deleted

    # oc create -f job.yml &&
      oc get job/pi -o=jsonpath='{.status}' -w &&
      oc get job/pi -o=jsonpath='{.status.conditions[*].type}' | grep -i -E 'failed|complete' || echo "Failed" 

    job.batch/pi created
    map[active:1 startTime:2019-03-09T12:31:05Z]Failed

I hope it help you. :)

Cyst answered 9/3, 2019 at 13:39 Comment(3)
i ended up just making a simple script to check the status as you had shown: until [[ $SECONDS -gt $end ]] || [[ $(kubectl get jobs $job_name -o jsonpath='{.status.conditions[?(@.type=="Failed")].status}') == "True" ]] || [[ $(kubectl get jobs $job_name -o jsonpath='{.status.conditions[?(@.type=="Complete")].status}') == "True" ]]; doAdria
That's great, and I'm sorry for showing the openshift cli example. But you can catch up as kubernetes cli, it's great !Cyst
actually there is no --wait and -w does stand for --watchReneareneau
F
3

You can use the following workaround using kubectl logs --follow:

kubectl wait --for=condition=ready pod --selector=job-name=YOUR_JOB_NAME --timeout=-1s
kubectl logs --follow job/YOUR_JOB_NAME

It will terminate when your job terminates, with any status.

Farhi answered 7/7, 2022 at 15:51 Comment(6)
It doesn't work for jobs with multiple containers.Photothermic
Found a solution for "multiple containers in a single job" case - you should add flag --all-containers=true to kubectl logs commandPhotothermic
tested, it exits after the first container completes, not applicable for initContainer + container; --pod-running-timeout=40s --ignore-errors=true not helpsNazarite
Your tip also works for standalone pods outside a jobCupulate
kubectl logs has an option --pod-running-timeout which seems to replace the first command -- and doesn't hang if the pod never comes up.😉Lithosphere
It also doesn't seem to work. Never mind. :/ Unfortunately, logs seems to terminate before the job switches its status to either succeeded or failed. So we have to wait again before being able to access the result. Square 1!Lithosphere
J
0

Here is what I did. I label my jobs so that I can use the labels to find them. I wait for the labeled job(s) to have status.ready=0

k wait -l label=value --for=jsonpath='{.status.ready}'=0 job

You can then use the following to find out if the first job in the items returned by the labels failed

failed=$(k get jobs -l label=value -o jsonpath={.items[0].status.failed})
exit ${failed}
Jemison answered 24/1 at 17:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.