CI jobs fail intermittently on GitLab-runner with error "fatal: shallow file has changed since we read it"
Asked Answered
W

2

7

Jobs on my self-hosted GitLab deployment recently started failing sometimes with this git error:

An example of a full job log is:

Running with gitlab-runner 13.12.0 (v13.12.0)
  on ....50ab V...
Preparing the "shell" executor
00:00
Using Shell executor...
Preparing environment
00:00
Running on saxtons...
Getting source from Git repository
00:03
$ /nix/store/s0frm5z2k43qm66q39ifl2vz96hmyxg4-pre-clone
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /var/lib/private/gitlab-runner/builds/V.../2/privatestorage/PrivateStorageio/.git/
fatal: shallow file has changed since we read it
Cleaning up file based variables
00:00
ERROR: Job failed: exit status 1

The pre-clone script contains this and is used to fix permissions on unwriteable directories that cause the runner's attempts to clean up the git checkout to fail:

chmod --recursive u+rwX .

The GitLab-runner's config.toml contains this:

check_interval = 0
concurrent = 8

[[runners]]
executor = "shell"
name = "...50ab"
pre_clone_script = "/nix/store/s0frm5z2k43qm66q39ifl2vz96hmyxg4-pre-clone"
token = "V..."
url = "https://.../"

[runners.cache]
[runners.cache.azure]

[runners.cache.gcs]

[runners.cache.s3]

[runners.custom_build_dir]

[[runners]]
executor = "docker"
name = "...5afc"
token = "..."
url = "https://.../"

[runners.cache]
[runners.cache.azure]

[runners.cache.gcs]

[runners.cache.s3]

[runners.custom_build_dir]

[runners.docker]
disable_cache = false
disable_entrypoint_overwrite = false
image = "nixos/nix"
oom_kill_disable = false
privileged = false
shm_size = 0
tls_verify = false
volumes = ["/cache"]

[[runners]]
executor = "docker"
name = "...c334"
token = "..."
url = "https://.../"

[runners.cache]
[runners.cache.azure]

[runners.cache.gcs]

[runners.cache.s3]

[runners.custom_build_dir]

[runners.docker]
disable_cache = false
disable_entrypoint_overwrite = false
image = ".../ubuntu-python3-awscli"
oom_kill_disable = false
privileged = false
shm_size = 0
tls_verify = false
volumes = ["/cache"]

[session_server]
session_timeout = 1800

GitLab-runner is deployed on NixOS 21.05 (using the NixOS package/service configuration).

I've never seen this git error before.

  • What does it indicate is happening?
  • How do I configure GitLab to stop doing whatever causes this?
Weinert answered 5/10, 2021 at 20:42 Comment(8)
It's easy to test. Comment the pre_clone_script line, wait 5 seconds (gitlab-runner auto-detects config changes and restarts itself automatically) and retry the job and see for changes. But, for sure, check if you have two runners with the same name =, also on different machines, and also confirm that [runners.custom_build_dir] gitlab-runner creats the jobs in different directories (I think they should be in /home/gitlab-runner/builds/projname/here, but not sure)Eileen
It's intermittent. It happens on some runs, not others. Ugh, I see I left the word "intermittent" out of the question. :/ Sorry about that. Also, without the pre_clone_script, it fails with permission errors before it gets this far.Weinert
It happens on some runs, not others The shallow file has changed suggest that two jobs for the same project are running at the same time and are using the same path together. So one is doing checkout and the other job is doing checkout, so one of them loses and sees updated files. Try running like 10 same-named jobs for the same project on different branches, for example.Eileen
Suppose this is happening. Then what?Weinert
Weeell, then you have to find the reason of such behavior and change it.Eileen
Great. I hope that someone who reads my question can help me do that.Weinert
@Jean-PaulCalderone did you ever find a solution?Mckenzie
I don't have a fix. I have a work-around where I chmod/chown all files beneath the runner path to the correct permissions/owner at the start of every single job. :/ (Hm. On reflection, I'm not sure how that would fix the problem in this question, so maybe I have another work-around somewhere that I've forgotten ...)Weinert
B
5

TL;DR:

Your build directories should be unique to the generated builds.

// .gitlab-ci.yml --> Add as a global config option
variables:
     GIT_CLONE_PATH: '$CI_BUILDS_DIR/$CI_PROJECT_NAME/$CI_PIPELINE_ID'


// Add to gitlab-runner config.toml
[[runners]]
  pre_clone_script = "rm -f /builds/*/*/.git/shallow.lock"
[runners.custom_build_dir]
  enabled = true

REASONING: I have a set-up of multiple docker gitlab-runners on the same host.

Concurrent pipelines running with Docker executors were accessing the same build directory:

/build/PROJECT_NAME/REPO/.git/

They would overwrite the directory contents. Also cancelling jobs during clone state would leave a shallow.lock file.

Bela answered 14/10, 2022 at 14:56 Comment(1)
How does this work if you're using shared runners where you don't have access to the host to change config.toml?Unpen
A
0

This question and answer finally fixed my problem. Though in my case I had to create build directory unique per job, not just per pipeline so in my case I set variables:

variables:
     GIT_CLONE_PATH: '$CI_BUILDS_DIR/$CI_PROJECT_NAME/job_$CI_JOB_ID'

And I see no reason to delete shadow.lock

Arianaariane answered 10/4 at 13:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.