How to change huggingface transformers default cache directory
Asked Answered
T

8

106

The default cache directory is lack of disk capacity, I need change the configure of the default cache directory.

Trishtrisha answered 8/8, 2020 at 7:28 Comment(3)
Symlinking the huggingface directory also works.Positively
not if there isn't enough "disk capacity".Harlequinade
@Harlequinade Maybe you were thinking of something different, but creating an empty directory on a different filesystem with more capacity and then making a symlink from ~/.cache/huggingface to that directory does work - at least until you need to clear the cache for some reason and forgot it was a symlink ;-) Setting HF_HOME is a bit cleaner, though, and works equally well on all platforms.Rounds
S
151

Update 2024

FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.

Python example:

import os
os.environ['HF_HOME'] = '/blabla/cache/'

bash example:

export HF_HOME=/blabla/cache/

Google Colab example (export via os works fine but not the bash variant. An alternative are the magic commands):

%env HF_HOME=/blabla/cache/

Old Answer:

You can specify the cache directory whenever you load a model with .from_pretrained by setting the parameter cache_dir. You can define a default location by exporting an environment variable TRANSFORMERS_CACHE every time before you use (i.e. before importing it!) the library).

Python example:

import os
os.environ['TRANSFORMERS_CACHE'] = '/blabla/cache/'

bash example:

export TRANSFORMERS_CACHE=/blabla/cache/

Google Colab example (export via os works fine but not the bash variant. An alternative are the magic commands):

%env TRANSFORMERS_CACHE=/blabla/cache/
Sumpter answered 8/8, 2020 at 10:39 Comment(6)
The "before importing the module" saved me for a related problem using flair, prompting me to import flair after changing the huggingface cache env variable.Greece
In addition, the environment variable for the datasets cache is HF_HOME. github.com/huggingface/transformers/issues/8703Cupola
first run below command on linux terminal then run first command in python codeUneventful
@Uneventful you don't need both, one of them is enough.Sumpter
@Sumpter if you do python and jupyter notebook in vscode then you will need both. One for each.Middlebuster
This solution also works if you meet OSError: [Errno 122] Disk quota exceeded for any reason.Overabound
C
64

As @cronoik mentioned, alternative to modify the cache path in the terminal, you can modify the cache directory directly in your code. I will just provide you with the actual code if you are having any difficulty looking it up on HuggingFace:

tokenizer = AutoTokenizer.from_pretrained("roberta-base", cache_dir="new_cache_dir/")

model = AutoModelForMaskedLM.from_pretrained("roberta-base", cache_dir="new_cache_dir/")
Cathay answered 21/10, 2020 at 7:6 Comment(0)
R
52

You'll probably want to set the HF_HOME environment variable:

export HF_HOME=/path/to/cache/directory

This is because besides the model cache of HF Transformers itself, there are cache directories of other Hugging Face libraries that also eat space in the home directory. The previous answers and comments did not make this clear.

In addition, it may make sense to set a symlink to catch cases where the environment variable is not set (you may have to move away the directory ~/.cache/huggingface before, if it exists):

ln -s /path/to/cache/directory ~/.cache/huggingface

In particular, the HF_HOME environment variable is also respected by Hugging Face datasets library, although the documentation does not explicitly state this.

The Transformers documentation describes how the default cache directory is determined:

Cache setup

Pretrained models are downloaded and locally cached at: ~/.cache/huggingface/hub. This is the default directory given by the shell environment variable TRANSFORMERS_CACHE.

On Windows, the default directory is given by C:\Users\username\.cache\huggingface\hub. You can change the shell environment variables shown below - in order of priority - to specify a different cache directory:

  1. Shell environment variable (default): HUGGINGFACE_HUB_CACHE or TRANSFORMERS_CACHE.
  2. Shell environment variable: HF_HOME.
  3. Shell environment variable: XDG_CACHE_HOME + /huggingface.

What this piece of documentation doesn't explicitly mention is that HF_HOME defaults to $XDG_CACHE_HOME/huggingface and is used for other huggingface caches, e.g. the datasets cache, which is separate from the transformers cache. The value of XDG_CACHE_HOME is machine dependent, but usually it is ~/.cache (and HF defaults to this value if XDG_CACHE_HOME is not set) - thus the usual default ~/.cache/huggingface

One minor downside of using HF_HOME instead of TRANSFORMERS_CACHE etc. is that the token for the Hugging Face Hub is stored under <HF_HOME>/token by default, so it will be deleted when you delete everything under HF_HOME - but this is also the case in the default setting. If you want to be able to clear the whole HF_HOME without deleting the access token, you can set the variable HF_TOKEN_PATH to a different value, e.g.:

export HF_TOKEN_PATH=$HOME/.huggingface_token

But if you are careful when deleting the caches (e.g. by deleting only one subdirectory of HF_HOME at a time), that won't be necessary.

Rounds answered 21/6, 2022 at 15:6 Comment(3)
this worked for me! after setting only TRANSFORMERS_CACHE still, the cache was saved on home dir but after the three settings, it worked!Hidebound
TRANSFORMERS_CACHE only controls the Hugging Face Transformers cache, i.e. for model checkpoints. Other HF libraries like datasets, evaluate, hub and autotrain have cache directories that are influenced by HF_HOME, but not by TRANSFORMERS_CACHE.Rounds
I linked to the main branch in my previous comment, so the content changes. For future reference: datasets, evaluate, hub and autotrainRounds
M
11

Typically, you want to keep datasets and model caches around for longer but other things not. Also, these things are large and you may not want in your home folder.

So, let's say you create directory \my_drive\hf where you want HuggingFace to cache everything. You can create following environment variables:

export HF_HOME=\my_drive\hf\misc
export HF_DATASETS_CACHE=\my_drive\hf\datasets
export TRANSFORMERS_CACHE=\my_drive\hf\models

Now you can clean out non essential things more easily.

Note that HF_HOME is basically cache location of all things on Hub but above you separate out datasets and models cache. The XDG_CACHE_HOME is not used if HF_HOME is set. If it wasn't set as above then HF_HOME defaults to $XDG_CACHE_HOME/huggingface.

More info: https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables

Magnate answered 25/6, 2023 at 8:14 Comment(0)
S
2

I usually customize and modify my cache_dir in three ways:

1. Setting a global environment variable.

export TRANSFORMERS_CACHE=/path/to/my/cache/directory

2. Setting the environment variable in the code

import os
from transformers import AutoModel

# Set the new cache directory
os.environ["TRANSFORMERS_CACHE"] = "/path/to/my/cache/directory"

# This directory will now be used as the cache
model = AutoModel.from_pretrained("bert-base-uncased")

This method only affects the currently running Python script.

3. Passing in the cache_dir as a parameter

from transformers import AutoModel, AutoConfig

# Set the cache directory
cache_dir = "/path/to/my/cache/directory"

# Load the model using the specified cache directory
config = AutoConfig.from_pretrained("bert-base-uncased", cache_dir=cache_dir)
model = AutoModel.from_pretrained("bert-base-uncased", config=config, cache_dir=cache_dir)

This method allows you to set the cache directory for specific models or configurations without affecting the global environment variables.

Simmonds answered 27/11, 2023 at 3:58 Comment(0)
J
2

What @Cronoik said is right, but it only works for the current shell.

If you want the cache directory to be permanently changed you need to add it to ~/.bashrc (on Linux) or similar scripts on other OS.

  1. Open ~/.bashrc in in editor. nano or Vim works fine.

     nano ~/.bashrc
    

    or

     vim ~/.bashrc
    
  2. Add this

     export HF_HOME="/path/to/dir"
    
  3. Save and close the file and reopen the terminal.

Now you won't have to set it every time. It'll be permanently set to /path/to/dir

Juxtaposition answered 14/2 at 16:44 Comment(0)
G
0

Create a directory on your host machine that is writable and mount that directory to the /app/cache which is a directory in the container

mkdir ~/my_cache

docker run -v ~/my_cache:/app/cache -e TRANSFORMERS_CACHE="/app/cache <image_name>
Granulation answered 3/10, 2023 at 23:1 Comment(0)
C
0

Maybe the below will help others. I couldn't get the accepted solution by @cronoik to work using the below Python code AS WELL as setting the environment vars in Linux. No luck.

# OS variable (doesn't work for me)
import os
os.environ['HF_HOME'] = '/my/folder/of/choice'

So using symlink instead. I had issues using symlink to the huggingface folder directly so instead I've done it to the containing folders 'hub' & 'modules' and it works fine now.

NOTE: I first deleted ALL content in the ~/.cache/huggingface folder - that means you will need to re-download all models again. I would start "clean" but of course you can alternatively copy/paste the content into the target folder as well or temporarily somewhere else in order to not loose any previously downloaded models.

# from inside the ~/.cache/huggingface folder open terminal and do this
ln -s /your/new/path/.../hugginface/hub hub
ln -s /your/new/path/.../hugginface/modules modules

Which creates a symlink for those 2 folders respectively. This now works perfectly.

ls -l # check if symlinks are registering
Claudetteclaudia answered 8/7 at 10:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.