If you have explicitly selected fast (Rust code)tokenisers, you may have done so for a reason. When dealing with large datasets, Rust-based tokenisers process data much faster and these can be explicitly invoked by setting the "use_fast" option during tokeniser creation. Almost all HF models nowadays come with this option. Though not obvious from the warning message, TOKENIZERS_PARALLELISM is an env variable & not a tokeniser hyper-parameter. By setting this to False, the problem does go away, but as some of the comments above show, there is confusion on the impact of this change. For e.g. does this effect parallelism at the model level? Let us look at the Rust code to see if what is the default behaviour and what may happen if we turn it OFF to solve the problem.
https://docs.rs/tokenizers/latest/src/tokenizers/utils/parallelism.rs.html
In most cases, we (the end user) would not have explicitly set the TOKENIZERS_PARALLELISM to True or False. For all such cases, the tokeniser code assumes it to be TRUE. We can explicitly disable it by setting it to False. But you can see that in that case, the code makes the iterator serialised. Even if you dont set this env variable to False, the executing code itself does so if it encounters Python forks later (and this is what causes the warning to be displayed in the first place). Can we avoid this?
Let us take a step back at the warning itself.
"The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...To disable this warning, you can either: - Avoid using tokenizers
before the fork if possible - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)"
This happens only with HF's FastTokenizers as these do parallel processing in Rust. In this situation, when we fork processes via multiprocessing in Python there is a conflict. Forking happens because we would have started looping over the data loader (with num_workers>0) in the train() method. This combination is deemed unsafe to work and if this is encountered, the tokeniser switches OFF parallelism in itself to avoid deadlocks. When we talk of parallelism here we strictly refer to the tokeniser code and NOT anything else. In other words, only the parts of code where we convert the input textual data into tokens (say using tokenizer.encode_plus or any of the other functions) is impacted. So this should not impact the use of parallel threads with num_workers which leverage the multiple GPU cores...like the data loader function. How can we tell this? Well, we can just try adding a 5 sec delay in the dataset get_item function along with a print statement and then see for ourselves by looping over the data loader for diff values of num_workers. When num_workers = 0, the main process does the heavy lifting and there is a gap of 5 sec between fetches. When num_workers = 1 a fork happens, we get the above warning on parallelism and since the main process does not participate in data-lifting, we still get a 5 sec gap between fetches. From num_workers > 2, there are multiple fetches depending on the num_workers in a 5-sec interval.
In fact this leads to the conclusion that a simple option to fix the above warning might be to simply make the num_workers = 0 in the data-loader definition. If num_workers is 0, then there is no Python fork and the main process itself does all the data-lifting. This works and we are able to now leverage the power of fast tokenisers to the hilt but at the compromise of eliminating parallel processing at the Python end. Considering that Data loaders work best in parallel mode by prefetching batches in parallel to GPU from host(CPU) for execution, this is usually NOT a good option.
What happens if we set TOKENIZERS_PARALLELISM=true? On the latest versions of Pytorch, transformers, tokenisers etc, if you do this and then try training with num_workers>0 in the data loader, your training will freeze without any error or even warning message. In fact, this issue motivated me to post this answer as I couldnt find a solution to fix that training-freeze problem anywhere. The root cause is actually the data loader which fails in this situation due to the above conflict (it refuses to "fork" due to fear of deadlocks).
So going back to our core issue, it seems that the RUST based parallelism is in conflict with the forks that we do in Python. However this may be easily solved by simply removing all use of tokenisers prior to the training call (ie prior to data loader being used). Many a times we may be using the tokenisers to see what the tokenised output is etc by doing a my_dataset_name[0]. Just remove all such tokeniser calls and let the train() function loop be the first time that tokenisers get accessed. This simple fix makes the RUST parellization happen after the Python fork and this should work.
Alternatively convert your data to tokens beforehand and store them in a dict. Then your dataset should not use the tokenizer at all but during runtime simply calls the dict(key) where key is the index. This way you avoid conflict. The warning still comes but you simply dont use tokeniser during training any more (note for such scenarios to save space, avoid padding during tokenise and add later with collate_fn)
Having said all that, the Rust tokeniser implementation is so insanely fast that it usually does not matter even if the serialised option is invoked inside the tokenizer i.e. if parallelism is automatically disabled in the tokenizer. It still beats the conventional tokeniser.
So in most cases, one can just ignore the warning and let the tokeniser parallelization be disabled during execution... or explicitly set the TOKENIZERS_PARALLELISM to False right from the beginning. In rare cases, where speed is of utmost importance, one of the above suggested options can be explored.