How to cause gitlab to retry on purpose?
Asked Answered
M

3

7

From this link, https://docs.gitlab.com/ee/ci/yaml/#retry

it shows that it is possible to cause gitlab to retry a job based on certain circumstances. Those circumstances are listed in the 'when' section. How do we cause a script to cause one of those retry conditions?

Do we return a number? How do we find what number?

For some reason, a service we're using sometimes is never recognized as ready to be used, so what I want to do is check for readiness for like 10 minutes and if it's still failing, fail the script with a reason of "stuck_or_timeout_failure" and then have:

retry:
  max: 5
  when:
    - stuck_or_timeout_failure

How do I get there?

Monty answered 16/4, 2020 at 4:38 Comment(0)
C
5

GitLab 14.6 (December 2021) offers to tell you why the job has failed:

Job failure reason returned in API response

It can be hard to use the API to gather data about why a job failed.
For example, you might want exact failure reasons to make better use of the retry:when keyword.

Now, the failure_reason is exposed in responses from the Jobs API, and it is much easier to gather job failure data.
Thanks to @albert.vacacintora for this contribution!

See Documentation and Issue.

However, as noted in the comments by Ben Farmer, that does not address how to tell GitLab why a job failed.

gitlab-org/gitlab issue 262674 illustrates that it is still a non-implemented feature:

[gitlab-ci] new "when" values for "retry" attribute acting on job's script output log with regex and/or exit code

As a dev/devops, I want my pipeline jobs to retry automatically on functional/technical script errors, so that I don't have to do it myself :)

currently the "retry" attribute in the gitlab ci allows us to use several "when" corresponding to gitlab or gitlab-runner errors, we would also like to be able to decide on a retry based on the exit code of our script or on a regex search of the job output log.


Custom conditions like stuck_or_timeout_failure are not yet supported in the standard GitLab CI configurations, but, as a workaround, where a job retries when it encounters a specific failure condition such as a service not being ready, you can script the behavior in your CI job.
The retry keyword in GitLab CI allows you to specify conditions under which the job should be retried.

Write a script that checks if the service is ready. If the service is not ready within your specified time frame (e.g., 10 minutes), the script should exit with a non-zero status code.
Use retry in .gitlab-ci.yml**: Configure the retry keyword in your .gitlab-ci.yml to respond to the failure. Simulate custom failure reasons like stuck_or_timeout_failure by having the script exit with a unique exit code for this specific failure.

Script (check_service.sh):

#!/bin/bash

# Function to check service readiness
check_service() {
    # Implement your service check logic here
    # Return 0 if ready, non-zero if not ready
}

# Try for 10 minutes
for i in {1..60}; do
    if check_service; then
        echo "Service is ready."
        exit 0
    fi
    sleep 10
done

# If service is not ready after 10 minutes, exit with a unique code (e.g., 123)
echo "Service not ready after 10 minutes."
exit 123

.gitlab-ci.yml:

job_name:
  script:
    - bash check_service.sh
  retry:
    max: 5
    when:
      - runner_system_failure
      - stuck_or_timeout_failure

Since stuck_or_timeout_failure is not a standard failure reason recognized by GitLab, it won't work as expected. Instead, the job will retry when it exits with a non-zero status code, which you can use to simulate the behavior you want.

This workaround should allow you to have a retry mechanism based on the readiness of a service. The job retries up to 5 times if the service is not ready within 10 minutes.

"script_failure" is way too broad in a lot of cases, one would want to be able to set a more specific failure reason from within the script if some expected sort of failure occurs. I don't want to retry my scripts for just any reason, only specific expected reasons.

Retrying jobs triggered based on specific exit codes or patterns in the output log, would be a solution to this limitation, but it is not yet available in GitLab CI/CD.

A convoluted workaround would be to memorize the failure exit code and using it in subsequent executions is a creative approach. However, implementing it directly within GitLab CI/CD's current framework presents challenges due to the stateless nature of CI/CD jobs. Each job execution in GitLab CI/CD is typically isolated and does not retain state or data from previous runs.

To approximate this behavior, you could use an external system or a workaround such as:

  • Store exit code externally: After a job fails with a specific exit code, store this code in an external system like a database, a file on a persistent storage, or an artifact that can be passed between jobs.

  • Read exit code in subsequent runs: At the beginning of each job, check the stored exit code from the external system. If it matches the specific failure code you are interested in, proceed with the retry logic. If it is a different code, you could either abort the job or exit with a success status.

A conceptual implementation using GitLab CI/CD artifacts to pass an exit code between jobs would be:

  1. First Job (Script Execution):

    • Execute your script.
    • Store the exit code in a file.
    • Set this file as an artifact to be passed to the next job.
  2. Second Job (Conditional Execution):

    • Retrieve the exit code from the artifact.
    • Execute only if the exit code matches a specific value.

That would allow subsequent jobs to check the previous job's exit code and conditionally execute based on that code.

As an example, your .gitlab-ci.yml would be:

stages:
  - test
  - conditional_execution

check_service:
  stage: test
  script:
    - ./check_service.sh || echo $? > exit_code.txt
  artifacts:
    paths:
      - exit_code.txt
    expire_in: 1 hour

conditional_job:
  stage: conditional_execution
  script:
    - exit_code=$(cat exit_code.txt)
    - echo "Previous job exit code: $exit_code"
    - if [ "$exit_code" -eq "YOUR_DESIRED_EXIT_CODE" ]; then
        echo "Executing conditional job based on exit code.";
        # Place your job execution logic here
      else
        echo "Skipping execution as exit code does not match.";
      fi
  needs:
    - job: check_service
      artifacts: true
  • check_service: That job runs your script, captures its exit code in exit_code.txt, and makes this file available as an artifact.
  • conditional_job: That job retrieves the exit code from the artifact. If the exit code matches the desired value (YOUR_DESIRED_EXIT_CODE), it proceeds with its execution logic. Otherwise, it skips execution.

That would "approximate" the behavior of remembering and acting upon a specific failure code.

But yes, GitLab should provide a native feature rending the above implementation obsolete.

Citrin answered 22/12, 2021 at 22:16 Comment(8)
I do not think this is what the OP is talking about. They aren't asking about querying GitLab as to why a job failed, they want to know how to tell GitLab why a job failed, e.g. from inside their script, so that they can then trigger a retry matching that failure reason. The retry docs (docs.gitlab.com/ee/ci/yaml/#retry) give a list of strings corresponding to failure reasons that can be used to trigger retries, but it says nothing about whether you can manually set these failure reasons yourself.Kisner
I.e. "script_failure" is way too broad in a lot of cases, one would want to be able to set a more specific failure reason from within the script if some expected sort of failure occurs. I don't want to retry my scripts for just any reason, only specific expected reasons.Kisner
It may not yet be possible. I see some issues requesting this feature, e.g.: gitlab.com/gitlab-org/gitlab/-/issues/262674Kisner
@BenFarmer Good point, thank you for the feedback. I have edited the answer to address your comment, and to propose a workaround.Citrin
Thanks for updating the answer! I don't really understand the workaround though. You say e.g. "By configuring the GitLab CI job to retry on any non-zero exit code, we make sure the job only retries for this specific failure scenario. That prevents retries for other types of script failures which might not be relevant or desirable." but how does that make sense? If GitLab is retrying for all non-zero exit codes then that means it is retrying for all script failures, not just specific ones. So isn't this exactly the same as just setting when: script_failure?Kisner
@BenFarmer True, and I have rewritten the last part of the answer. There is no direct implementation of such a feature, only some workaround.Citrin
Ok thanks. The idea of doing something with artifacts sounds interesting, but yeah getting complicated. Let's hope GitLab get around to implementing this sometime because it would be extremely useful...Kisner
@BenFarmer complicated, but not infeasible. I have edited the answer to include a possible implementation. Not ideal, but a possible workaround, pending GitLab's implementation of that feature.Citrin
K
2

You can do this using an appropriate timeout: for the job and retry:when: with the job_execution_timeout condition:

  • job_execution_timeout: Retry if the script exceeded the maximum execution time set for the job.

In order for this condition to be met, there's no specific thing you must do in your script, except have it run longer than the configured timeout for the job.

In this example, we set a timeout of 10 minutes and configure up to 5 retries. If the job runs longer than 10 minutes, it will fail and be retried, up to 5 times. If the job timesout after the 5 max retries are met, the job and pipeline will ultimately fail with a job timeout failure.

check_for_readiness:
  timeout: 10 minutes
  retry:
    max: 5
    when:
      - stuck_or_timeout_failure
      - job_execution_timeout
  script:
    # assumes `check_for_readiness.sh` checks for readiness just once 
    # and exits with a nonzero code when the service is not ready
    - |
      while true; do
        echo "checking"
        check_for_readiness.sh && exit 0 || sleep 1
      done

Although, a simpler approach would be to just not use retry: at all, and just use an infinite loop and a total job timeout of 50 minutes (or whatever you want). That should be more or less functionally equivalent to the previous example (10 minutes * 5 retries = 50 minutes).

check_for_readiness:
  timeout: 50 minutes
  script:
    - |
      while true; do
        echo "checking"
        check_for_readiness.sh && exit 0 || sleep 1
      done

Alternatively still, you can also just handle the timeout, retry, and any custom condition logic all within the script itself, similar to what VonC suggested.

Beyond the available conditions in retry:when, you can just implement retrying logic in the script itself... for example, maybe something like this:

#!/usr/bin/env bash
set +e


START_TIME=$(date +%s)

while true; do
  ./check_for_readiness
  exit_code=$?

  # any special handling based on exit code
  if [[ $exit_code == 0 ]]; then
    exit 0
  fi
  
  # For example, don't retry on specific exit code
  if [[ $exit_code == 111 ]]; then
    exit 111
  fi

  # Handle timeout
  ELAPSED=$(($(date +%s) - START_TIME))
  if [[ $ELAPSED -gt 600 ]]; then
    echo "TIMEOUT" > /dev/stderr
    exit 1
  fi
  sleep 1
done

Kennel answered 21/11, 2023 at 21:2 Comment(1)
Interesting alternative approach. Upvoted.Citrin
E
0

In GitLab 16.11, it is now possible by default to use retry:exit_codes https://docs.gitlab.com/ee/ci/yaml/#retryexit_codes

Example:

readiness:
  script:
        - echo "checking..."
        - check_for_readiness.sh && exit 0 || exit 2
  retry:
    max: 2
    exit_codes:
      - 2
      - 137 # OOM kill on K8S

This empower a lot of possibilities but provide less control.

Indeed, it is important to mention that having a retry inside the job would still be more efficient as the retry here will recreate a job therefore a new environment and clone/fetch again.

For uncatchable exit codes in your script there are benefits of using that approach (see oom kill example above)

Exclaim answered 11/5 at 16:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.