CI jobs fail intermittently on GitLab-runner with error "fatal: shallow file has changed since we read it"

Asked 5/10, 2021 at 20:42 Answered 10/4 at 13:0

git gitlab gitlab-ci gitlab-ci-runner nixos

Jobs on my self-hosted GitLab deployment recently started failing sometimes with this git error:

An example of a full job log is:

Running with gitlab-runner 13.12.0 (v13.12.0)
  on ....50ab V...
Preparing the "shell" executor
00:00
Using Shell executor...
Preparing environment
00:00
Running on saxtons...
Getting source from Git repository
00:03
$ /nix/store/s0frm5z2k43qm66q39ifl2vz96hmyxg4-pre-clone
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /var/lib/private/gitlab-runner/builds/V.../2/privatestorage/PrivateStorageio/.git/
fatal: shallow file has changed since we read it
Cleaning up file based variables
00:00
ERROR: Job failed: exit status 1

The pre-clone script contains this and is used to fix permissions on unwriteable directories that cause the runner's attempts to clean up the git checkout to fail:

chmod --recursive u+rwX .

The GitLab-runner's config.toml contains this:

check_interval = 0
concurrent = 8

[[runners]]
executor = "shell"
name = "...50ab"
pre_clone_script = "/nix/store/s0frm5z2k43qm66q39ifl2vz96hmyxg4-pre-clone"
token = "V..."
url = "https://.../"

[runners.cache]
[runners.cache.azure]

[runners.cache.gcs]

[runners.cache.s3]

[runners.custom_build_dir]

[[runners]]
executor = "docker"
name = "...5afc"
token = "..."
url = "https://.../"

[runners.cache]
[runners.cache.azure]

[runners.cache.gcs]

[runners.cache.s3]

[runners.custom_build_dir]

[runners.docker]
disable_cache = false
disable_entrypoint_overwrite = false
image = "nixos/nix"
oom_kill_disable = false
privileged = false
shm_size = 0
tls_verify = false
volumes = ["/cache"]

[[runners]]
executor = "docker"
name = "...c334"
token = "..."
url = "https://.../"

[runners.cache]
[runners.cache.azure]

[runners.cache.gcs]

[runners.cache.s3]

[runners.custom_build_dir]

[runners.docker]
disable_cache = false
disable_entrypoint_overwrite = false
image = ".../ubuntu-python3-awscli"
oom_kill_disable = false
privileged = false
shm_size = 0
tls_verify = false
volumes = ["/cache"]

[session_server]
session_timeout = 1800

GitLab-runner is deployed on NixOS 21.05 (using the NixOS package/service configuration).

I've never seen this git error before.

What does it indicate is happening?
How do I configure GitLab to stop doing whatever causes this?

Weinert answered 5/10, 2021 at 20:42 Comment(8)

It's easy to test. Comment the pre_clone_script line, wait 5 seconds (gitlab-runner auto-detects config changes and restarts itself automatically) and retry the job and see for changes. But, for sure, check if you have two runners with the same name =, also on different machines, and also confirm that [runners.custom_build_dir] gitlab-runner creats the jobs in different directories (I think they should be in /home/gitlab-runner/builds/projname/here, but not sure) – Eileen 5/10, 2021 at 21:7

It's intermittent. It happens on some runs, not others. Ugh, I see I left the word "intermittent" out of the question. :/ Sorry about that. Also, without the pre_clone_script, it fails with permission errors before it gets this far. – Weinert 5/10, 2021 at 22:8

It happens on some runs, not others The shallow file has changed suggest that two jobs for the same project are running at the same time and are using the same path together. So one is doing checkout and the other job is doing checkout, so one of them loses and sees updated files. Try running like 10 same-named jobs for the same project on different branches, for example. – Eileen 5/10, 2021 at 22:10

Suppose this is happening. Then what? – Weinert 6/10, 2021 at 12:40

Weeell, then you have to find the reason of such behavior and change it. – Eileen 6/10, 2021 at 13:8

Great. I hope that someone who reads my question can help me do that. – Weinert 6/10, 2021 at 13:35

@Jean-PaulCalderone did you ever find a solution? – Mckenzie 4/7, 2022 at 11:48

I don't have a fix. I have a work-around where I chmod/chown all files beneath the runner path to the correct permissions/owner at the start of every single job. :/ (Hm. On reflection, I'm not sure how that would fix the problem in this question, so maybe I have another work-around somewhere that I've forgotten ...) – Weinert 4/7, 2022 at 20:24

TL;DR:

Your build directories should be unique to the generated builds.

// .gitlab-ci.yml --> Add as a global config option
variables:
     GIT_CLONE_PATH: '$CI_BUILDS_DIR/$CI_PROJECT_NAME/$CI_PIPELINE_ID'


// Add to gitlab-runner config.toml
[[runners]]
  pre_clone_script = "rm -f /builds/*/*/.git/shallow.lock"
[runners.custom_build_dir]
  enabled = true

REASONING: I have a set-up of multiple docker gitlab-runners on the same host.

Concurrent pipelines running with Docker executors were accessing the same build directory:

/build/PROJECT_NAME/REPO/.git/

They would overwrite the directory contents. Also cancelling jobs during clone state would leave a shallow.lock file.

Bela answered 14/10, 2022 at 14:56 Comment(1)

How does this work if you're using shared runners where you don't have access to the host to change config.toml? – Unpen 22/12, 2022 at 19:40

This question and answer finally fixed my problem. Though in my case I had to create build directory unique per job, not just per pipeline so in my case I set variables:

variables:
     GIT_CLONE_PATH: '$CI_BUILDS_DIR/$CI_PROJECT_NAME/job_$CI_JOB_ID'

And I see no reason to delete shadow.lock

Arianaariane answered 10/4 at 13:0 Comment(0)

Recommended topics

Hot tags