Caching virtual environment for gitlab-ci

Asked 31/1, 2018 at 10:47 Answered 17/7, 2024 at 8:28

I cached Pip packages using a Gitlab CI script, so that's not an issue.

Now I also want to catch a Conda virtual environment, because it reduces time to setup the environment.

I cached a virtual environment. Unfortunately it takes a long time at the end to cache all the venv files.

I tried to cache only the $CI_PROJECT_DIR/myenv/lib/python3.6/site-packages folder and it seems to reduce run time of the pipe.

My question is: am I doing it correctly?

The script is given below:

gitlab-ci.yml

image: continuumio/miniconda3:latest

cache:
  paths:
    - .pip
    - ls -l $CI_PROJECT_DIR/myvenv/lib/python3.6/site-packages
    - $CI_PROJECT_DIR/myvenv/lib/python3.6/site-packages

before_script:
  - chmod +x gitlab-ci.sh
  - ./gitlab-ci.sh

stages:
  - test

test:
  stage: test
  script:
    - python eval.py

gitlab-ci.sh

#!/usr/bin/env bash
ENV_NAME=myenv
ENV_REQUIREMENTS=requirements.txt

if [ ! -d $ENV_NAME ]; then
    echo "Environment $ENV_NAME does not exist. Creating it now!"
    conda create --path --prefix "$CI_PROJECT_DIR/$ENV_NAME"
fi

echo "Activating environment: $CI_PROJECT_DIR/$ENV_NAME"
source activate "$CI_PROJECT_DIR/$ENV_NAME"

echo "Installing PIP"
conda install -y pip

echo "PIP: installing required packages"
echo `which pip`
pip --cache-dir=.pip install -r "$ENV_REQUIREMENTS"

Pearly answered 31/1, 2018 at 10:47 Comment(0)

Reusing pip cache between builds is a very good idea but doing the same for the virtualenvs is a really bad idea.

This is because virtualenv can easily become messed in a way that you cannot really detect at runtime. This not only happens, it happens more often than you could imagine and for that reason please avoid it.

PS. Advise from someone that learnt that the hard way.

Sturgill answered 27/2, 2019 at 16:19 Comment(3)

Could you give more details in which way it could become messed? – Divisor 8/10, 2020 at 17:36

@Sturgill - it's possible you could be correct, however gitlab OFFICIAL docs contradict your answer. Your answer needs explanation or downvotes. docs.gitlab.com/ee/ci/caching/#caching-python-dependencies – Foghorn 26/10, 2020 at 15:56

Actually, GitLab docs have been updated to recommend exactly this approach. – Gabo 26/9, 2022 at 11:27

I don't have enough rep to comment on @sorin's answer, but we're running into the same issue right now with a current GitLab (14.6).

We have four jobs, all using the same base Docker image. In one, we set up a virtualenv and then cache it; the other 3 jobs pull the cache, activate the venv, then try to use it. Those 3 jobs often fail, as they are unable to find the right python or load particular modules from the activated venv.

The problem with virtualenv is that (at least as of the venv module in Python 3.3) virtualenvs are not relocatable. The activate script contains the absolute path to the virtualenv in a VIRTUAL_ENV variable. GitLab runners by default include the unique runner token as part of the build directory, which then becomes part of that VIRTUAL_ENV variable. So if you cache the virtualenv on one runner, then try to use it on another runner, it will fail because the paths don't match. activate won't even warn you that the VIRTUAL_ENV path doesn't exist.

If you have one GitLab runner, you're probably OK. If not, you can write scripts to update a virtualenv yourself, which may or may not work well (see Can I move a virtualenv?). Or do the safe thing and recreate the venv in every job; you can get away with caching the pip cache, at least.

Disencumber answered 27/1, 2022 at 0:48 Comment(0)

We are successfully using the method outlined in the docs https://docs.gitlab.com/ee/ci/caching/#caching-python-dependencies

# Change pip's cache directory to be inside the project directory since we can
# only cache local items.
variables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

# Pip's cache doesn't store the python packages
# https://pip.pypa.io/en/stable/reference/pip_install/#caching
#
# If you want to also cache the installed packages, you have to install
# them in a virtualenv and cache it as well.
cache:
  paths:
    - .cache/pip
    - venv/

There may be other things missing, but your first pass probably misses:

when reducing size of the whole .../venv/ directory tree, you probably need .../venv/bin as this is required to find the correct python version; see this locally after activateing your venv with the command which -a python3
if pip will be used again (such as later in your build via some make recipe) you need to move the pip cache as shown above.

Foghorn answered 26/10, 2020 at 16:1 Comment(0)

In our project using Kubernetes runners we use the following approach:

variables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

cache:
  paths:
    - .cache/pip
    - venv

.python-base:
  image: "some-image-with-python3"
  before_script:
    - test -f venv/bin/python || python3 -m venv venv
    - source venv/bin/activate
    - pip install --upgrade pip
    - pip install -r requirements.txt

some-python-job:
  extends: .python-base

The working directory is always the same on k8s runners, so no problem with references to non-existing directories. By verifying if the link to python can be resolved a new venv will be created only when the python version changed or no venv was present.

Not sure what other things can be messed up (see sorin's answer), but for us this works very well so far. We use it mainly for python scripts that run as part of the build itself where it really saves time. Maybe its wise to be more cautious when using this to produce python packages used elsewhere.

Macaroon answered 17/7, 2024 at 8:28 Comment(0)

Recommended topics

Hot tags