The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.
Update 2024
FutureWarning: Using
TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. UseHF_HOME
instead.
Python example:
import os
os.environ['HF_HOME'] = '/blabla/cache/'
bash example:
export HF_HOME=/blabla/cache/
Google Colab example (export via os
works fine but not the bash variant. An alternative are the magic commands):
%env HF_HOME=/blabla/cache/
Old Answer:
You can specify the cache directory whenever you load a model with .from_pretrained by setting the parameter cache_dir
. You can define a default location by exporting an environment variable TRANSFORMERS_CACHE every time before you use (i.e. before importing it!) the library).
Python example:
import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'
bash example:
export TRANSFORMERS_CACHE=/blabla/cache/
Google Colab example (export via os
works fine but not the bash variant. An alternative are the magic commands):
%env TRANSFORMERS_CACHE=/blabla/cache/
HF_HOME
. github.com/huggingface/transformers/issues/8703 –
Cupola OSError: [Errno 122] Disk quota exceeded
for any reason. –
Overabound As @cronoik mentioned, alternative to modify the cache path in the terminal, you can modify the cache directory directly in your code. I will just provide you with the actual code if you are having any difficulty looking it up on HuggingFace:
tokenizer = AutoTokenizer.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
model = AutoModelForMaskedLM.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
You'll probably want to set the HF_HOME
environment variable:
export HF_HOME=/path/to/cache/directory
This is because besides the model cache of HF Transformers itself, there are cache directories of other Hugging Face libraries that also eat space in the home directory. The previous answers and comments did not make this clear.
In addition, it may make sense to set a symlink to catch cases where the environment variable is not set (you may have to move away the directory ~/.cache/huggingface
before, if it exists):
ln -s /path/to/cache/directory ~/.cache/huggingface
In particular, the HF_HOME
environment variable is also respected by Hugging Face datasets library, although the documentation does not explicitly state this.
The Transformers documentation describes how the default cache directory is determined:
Cache setup
Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE.
On Windows, the default directory is given by
C:\Users\username\.cache\huggingface\hub
. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:
- Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
- Shell environment variable: HF_HOME.
- Shell environment variable: XDG_CACHE_HOME + /huggingface.
What this piece of documentation doesn't explicitly mention is that HF_HOME defaults to $XDG_CACHE_HOME/huggingface
and is used for other huggingface caches, e.g. the datasets cache, which is separate from the transformers cache. The value of XDG_CACHE_HOME is machine dependent, but usually it is ~/.cache
(and HF defaults to this value if XDG_CACHE_HOME is not set) - thus the usual default ~/.cache/huggingface
One minor downside of using HF_HOME
instead of TRANSFORMERS_CACHE
etc. is that the token for the Hugging Face Hub is stored under <HF_HOME>/token
by default, so it will be deleted when you delete everything under HF_HOME
- but this is also the case in the default setting. If you want to be able to clear the whole HF_HOME without deleting the access token, you can set the variable HF_TOKEN_PATH
to a different value, e.g.:
export HF_TOKEN_PATH=$HOME/.huggingface_token
But if you are careful when deleting the caches (e.g. by deleting only one subdirectory of HF_HOME
at a time), that won't be necessary.
Typically, you want to keep datasets and model caches around for longer but other things not. Also, these things are large and you may not want in your home folder.
So, let's say you create directory \my_drive\hf
where you want HuggingFace to cache everything. You can create following environment variables:
export HF_HOME=\my_drive\hf\misc
export HF_DATASETS_CACHE=\my_drive\hf\datasets
export TRANSFORMERS_CACHE=\my_drive\hf\models
Now you can clean out non essential things more easily.
Note that HF_HOME
is basically cache location of all things on Hub but above you separate out datasets and models cache. The XDG_CACHE_HOME
is not used if HF_HOME
is set. If it wasn't set as above then HF_HOME
defaults to $XDG_CACHE_HOME/huggingface
.
More info: https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables
I usually customize and modify my cache_dir
in three ways:
1. Setting a global environment variable.
export TRANSFORMERS_CACHE=/path/to/my/cache/directory
2. Setting the environment variable in the code
import os
from transformers import AutoModel
# Set the new cache directory
os.environ["TRANSFORMERS_CACHE"] = "/path/to/my/cache/directory"
# This directory will now be used as the cache
model = AutoModel.from_pretrained("bert-base-uncased")
This method only affects the currently running Python script.
3. Passing in the cache_dir
as a parameter
from transformers import AutoModel, AutoConfig
# Set the cache directory
cache_dir = "/path/to/my/cache/directory"
# Load the model using the specified cache directory
config = AutoConfig.from_pretrained("bert-base-uncased", cache_dir=cache_dir)
model = AutoModel.from_pretrained("bert-base-uncased", config=config, cache_dir=cache_dir)
This method allows you to set the cache directory for specific models or configurations without affecting the global environment variables.
What @Cronoik said is right, but it only works for the current shell.
If you want the cache directory to be permanently changed you need to add it to ~/.bashrc
(on Linux) or similar scripts on other OS.
Open
~/.bashrc
in in editor. nano or Vim works fine.nano ~/.bashrc
or
vim ~/.bashrc
Add this
export HF_HOME="/path/to/dir"
Save and close the file and reopen the terminal.
Now you won't have to set it every time. It'll be permanently set to /path/to/dir
Create a directory on your host machine that is writable and mount that directory to the /app/cache
which is a directory in the container
mkdir ~/my_cache
docker run -v ~/my_cache:/app/cache -e TRANSFORMERS_CACHE="/app/cache <image_name>
Maybe the below will help others. I couldn't get the accepted solution by @cronoik to work using the below Python code AS WELL as setting the environment vars in Linux. No luck.
# OS variable (doesn't work for me)
import os
os.environ['HF_HOME'] = '/my/folder/of/choice'
So using symlink instead. I had issues using symlink to the huggingface folder directly so instead I've done it to the containing folders 'hub' & 'modules' and it works fine now.
NOTE: I first deleted ALL content in the ~/.cache/huggingface folder - that means you will need to re-download all models again. I would start "clean" but of course you can alternatively copy/paste the content into the target folder as well or temporarily somewhere else in order to not loose any previously downloaded models.
# from inside the ~/.cache/huggingface folder open terminal and do this
ln -s /your/new/path/.../hugginface/hub hub
ln -s /your/new/path/.../hugginface/modules modules
Which creates a symlink for those 2 folders respectively. This now works perfectly.
ls -l # check if symlinks are registering
© 2022 - 2024 — McMap. All rights reserved.
huggingface
directory also works. – Positively