GitLab 14.6 (December 2021) offers to tell you why the job has failed:
Job failure reason returned in API response
It can be hard to use the API to gather data about why a job failed.
For example, you might want exact failure reasons to make better use of the retry:when
keyword.
Now, the failure_reason
is exposed in responses from the Jobs API, and it is much easier to gather job failure data.
Thanks to @albert.vacacintora for this contribution!
See Documentation and Issue.
However, as noted in the comments by Ben Farmer, that does not address how to tell GitLab why a job failed.
gitlab-org/gitlab issue 262674 illustrates that it is still a non-implemented feature:
[gitlab-ci] new "when
" values for "retry
" attribute acting on job's script output log with regex and/or exit code
As a dev/devops, I want my pipeline jobs to retry automatically on functional/technical script errors, so that I don't have to do it myself :)
currently the "retry
" attribute in the gitlab ci allows us to use several "when" corresponding to gitlab or gitlab-runner errors, we would also like to be able to decide on a retry based on the exit code of our script or on a regex search of the job output log.
Custom conditions like stuck_or_timeout_failure
are not yet supported in the standard GitLab CI configurations, but, as a workaround, where a job retries when it encounters a specific failure condition such as a service not being ready, you can script the behavior in your CI job.
The retry
keyword in GitLab CI allows you to specify conditions under which the job should be retried.
Write a script that checks if the service is ready. If the service is not ready within your specified time frame (e.g., 10 minutes), the script should exit with a non-zero status code.
Use retry
in .gitlab-ci.yml
**: Configure the retry
keyword in your .gitlab-ci.yml
to respond to the failure.
Simulate custom failure reasons like stuck_or_timeout_failure
by having the script exit with a unique exit code for this specific failure.
Script (check_service.sh):
#!/bin/bash
# Function to check service readiness
check_service() {
# Implement your service check logic here
# Return 0 if ready, non-zero if not ready
}
# Try for 10 minutes
for i in {1..60}; do
if check_service; then
echo "Service is ready."
exit 0
fi
sleep 10
done
# If service is not ready after 10 minutes, exit with a unique code (e.g., 123)
echo "Service not ready after 10 minutes."
exit 123
.gitlab-ci.yml:
job_name:
script:
- bash check_service.sh
retry:
max: 5
when:
- runner_system_failure
- stuck_or_timeout_failure
Since stuck_or_timeout_failure
is not a standard failure reason recognized by GitLab, it won't work as expected. Instead, the job will retry when it exits with a non-zero status code, which you can use to simulate the behavior you want.
This workaround should allow you to have a retry mechanism based on the readiness of a service. The job retries up to 5 times if the service is not ready within 10 minutes.
"script_failure" is way too broad in a lot of cases, one would want to be able to set a more specific failure reason from within the script if some expected sort of failure occurs. I don't want to retry my scripts for just any reason, only specific expected reasons.
Retrying jobs triggered based on specific exit codes or patterns in the output log, would be a solution to this limitation, but it is not yet available in GitLab CI/CD.
A convoluted workaround would be to memorize the failure exit code and using it in subsequent executions is a creative approach. However, implementing it directly within GitLab CI/CD's current framework presents challenges due to the stateless nature of CI/CD jobs. Each job execution in GitLab CI/CD is typically isolated and does not retain state or data from previous runs.
To approximate this behavior, you could use an external system or a workaround such as:
Store exit code externally: After a job fails with a specific exit code, store this code in an external system like a database, a file on a persistent storage, or an artifact that can be passed between jobs.
Read exit code in subsequent runs: At the beginning of each job, check the stored exit code from the external system. If it matches the specific failure code you are interested in, proceed with the retry logic. If it is a different code, you could either abort the job or exit with a success status.
A conceptual implementation using GitLab CI/CD artifacts to pass an exit code between jobs would be:
First Job (Script Execution):
- Execute your script.
- Store the exit code in a file.
- Set this file as an artifact to be passed to the next job.
Second Job (Conditional Execution):
- Retrieve the exit code from the artifact.
- Execute only if the exit code matches a specific value.
That would allow subsequent jobs to check the previous job's exit code and conditionally execute based on that code.
As an example, your .gitlab-ci.yml
would be:
stages:
- test
- conditional_execution
check_service:
stage: test
script:
- ./check_service.sh || echo $? > exit_code.txt
artifacts:
paths:
- exit_code.txt
expire_in: 1 hour
conditional_job:
stage: conditional_execution
script:
- exit_code=$(cat exit_code.txt)
- echo "Previous job exit code: $exit_code"
- if [ "$exit_code" -eq "YOUR_DESIRED_EXIT_CODE" ]; then
echo "Executing conditional job based on exit code.";
# Place your job execution logic here
else
echo "Skipping execution as exit code does not match.";
fi
needs:
- job: check_service
artifacts: true
check_service
: That job runs your script, captures its exit code in exit_code.txt
, and makes this file available as an artifact.
conditional_job
: That job retrieves the exit code from the artifact. If the exit code matches the desired value (YOUR_DESIRED_EXIT_CODE
), it proceeds with its execution logic. Otherwise, it skips execution.
That would "approximate" the behavior of remembering and acting upon a specific failure code.
But yes, GitLab should provide a native feature rending the above implementation obsolete.