How to efficiently run multiple Pytorch Processes / Models at once ? Traceback: The paging file is too small for this operation to complete

Asked 14/11, 2020 at 18:42 Answered 26/10, 2022 at 23:10

Solved python pytorch python-multiprocessing

Background

I have a very small network which I want to test with different random seeds. The network barely uses 1% of my GPUs compute power so i could in theory run 50 processes at once to try many different seeds at once.

Problem

Unfortunately i can't even import pytorch in multiple processes. When the nr of processes exceeds 4 I get a Traceback regarding a too small paging file.

Minimal reproducable code§ - dispatcher.py

from subprocess import Popen
import sys

procs = []
for seed in range(50):
    procs.append(Popen([sys.executable, "ml_model.py", str(seed)]))

for proc in procs:
    proc.wait()

§I increased the number of seeds so people with better machines can also reproduce this.

Minimal reproducable code - ml_model.py

import torch
import time
time.sleep(10)

 
 Traceback (most recent call last):
  File "ml_model.py", line 1, in <module>
    import torch
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
    import torch
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
 OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll" or one of its dependencies.
    raise err

Further Investigation

I noticed that each process loads a lot of dll's into RAM. And when i close all other programs which use a lot of RAM i can get up to 10 procesess instead of 4. So it seems like a resource constraint.

Questions

Is there a workaround ?

What's the recommended way to train many small networks with pytorch on a single gpu ?

Should i write my own CUDA Kernel instead, or use a different framework to achieve this ?

My goal would be to run around 50 processes at once (on a 16GB RAM Machine, 8GB GPU RAM)

Jibheaded answered 14/11, 2020 at 18:42 Comment(4)

Hi.could you please list files in the folder "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\" If cudnn_cnn_infer64_8.dll isn't there you might have an issue with pytorch install with GPU support – Corell 28/11, 2020 at 1:19

Here is an open ticket on the similar issue with no solution github.com/Spandan-Madan/Pytorch_fine_tuning_Tutorial/issues/10 – Vonnievonny 28/11, 2020 at 1:50

@IlyesKAANICH the file is there along with many other dll's . – Jibheaded 29/11, 2020 at 3:5

the number of processes must be less than the number of logical processors in your CPU – Corell 30/11, 2020 at 10:45

I've looked a bit into this tonight. I don't have a solution (edit: I have a mitigation, see the edit at end), but I have a bit more information.

It seems the issue is caused by NVidia fatbins (.nv_fatb) being loaded into memory. Several DLLs, such as cusolver64_xx.dll, torcha_cuda_cu.dll, and a few others, have .nv_fatb sections in them. These contain tons of different variations of CUDA code for different GPUs, so it ends up being several hundred megabytes to a couple gigabytes.

When Python imports 'torch' it loads these DLLs, and maps the .nv_fatb section into memory. For some reason, instead of just being a memory mapped file, it is actually taking up memory. The section is set as 'copy on write', so it's possible something writes into it? I don't know. But anyway, if you look at Python using VMMap ( https://learn.microsoft.com/en-us/sysinternals/downloads/vmmap ) you can see that these DLLs are committing huge amounts of committed memory for this .nv_fatb section. The frustrating part is it doesn't seem to be using the memory. For example, right now my Python.exe has 2.7GB committed, but the working set is only 148MB.

Every Python process that loads these DLLs will commit several GB of memory loading these DLLs. So if 1 Python process is wasting 2GB of memory, and you try running 8 workers, you need 16GB of memory to spare just to load the DLLs. It really doesn't seem like this memory is used, just committed.

I don't know enough about these fatbinaries to try to fix it, but from looking at this for the past 2 hours it really seems like they are the issue. Perhaps its an NVidia problem that these are committing memory?

edit: I made this python script: https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5

Get it and install its pefile dependency ( python -m pip install pefile ).

Run it on your torch and cuda DLLs. In OPs case, command line might look like:

python fixNvPe.py --input=C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\*.dll

(You also want to run this wherever your cusolver64_*.dll and friends are. This may be in your torch\lib folder, or it may be, eg, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.X\bin . If it is under Program Files, you will need to run the script with administrative privileges)

What this script is going to do is scan through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.

ASLR is 'address space layout randomization.' It is a security feature that randomizes where a DLL is loaded in memory. We disable it for this DLL so that all Python processes will load the DLL into the same base virtual address. If all Python processes using the DLL load it at the same base address, they can all share the DLL. Otherwise each process needs its own copy.

Marking the section 'read-only' lets Windows know that the contents will not change in memory. If you map a file into memory read/write, Windows has to commit enough memory, backed by the pagefile, just in case you make a modification to it. If the section is read-only, there is no need to back it in the pagefile. We know there are no modifications to it, so it can always be found in the DLL.

The theory behind the script is that by changing these 2 flags that less memory will be committed for the .nv_fatb, and more memory will be shared between the Python processes. In practice, it works. Not quite as well as I'd hope (it still commits a lot more than it uses), so my understanding may be flawed, but it significantly decreases memory commit.

In my limited testing I haven't ran into any issues, but I can't guarantee there are no code paths that attempts to write to that section we marked 'read only.' If you start running into issues, though, you can just restore the backups.

edit 2022-01-20: Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ."

This should certainly help. If it's not enough without ASLR as well then the script should still work

Eartha answered 8/10, 2021 at 0:30 Comment(10)

This is amazing - thank you. It reduced memory usage significantly. – Selina 3/11, 2021 at 20:22

Confirmed - while trying to run a logistical regression, I thought I had data corruption - turned out this was the exact issue. After the pip install, script ran, and Jupyter was happy. – Dissimilarity 16/11, 2021 at 23:8

Just to clarify, by memory, you mean CPU memory right? – Assumptive 22/11, 2021 at 11:30

Yes, this is all CPU memory. – Hamite 22/11, 2021 at 18:54

Thx for this answer I set it to accepted now since it's the best of all the answers and improves memory usage. And thx for keeping it up to date with the latest information from NVIDIA. – Jibheaded 21/1, 2022 at 15:56

As an additional bonus, after applying this fix my Pytorch training workload seemed to run about 20% faster :) – Chrysoprase 28/1, 2022 at 6:25

Anybody tried that trick on linux (it should works with .so too ?) – Overdress 4/6, 2022 at 22:3

Thank you so much! I am able to double my batch size AND triple my num_workers. This means I was only able to utilize about 1/6 of my memory before! – Schwaben 25/7, 2022 at 7:3

Could you clarify that last comment from NVIDIA? I'm a little confused because I think CUDA is not the same as CUDNN. I downloaded CUDA 11.8, and could not find any of those cudnn_* dll's in the install folder. I did find cudart64_110.dll which is also used by the pytorch libraries, but replacing this alone probably won't fix the errors caused by the cudnn_ dll's, right? So we need to download CUDNN separately but the question is when nvidia says they "fixed" this issue in "CUDA 11.7", what is the corresponding CUDNN version that it's fixed in? – Melinamelinda 23/10, 2022 at 21:0

You shouldn't have to download any new DLLs for this fix. If you are hitting the 'paging file too small' error, then run the script on whatever DLLs you are using. The script will only modify files that have an 'nvfatbin' section, and all that it does is flip a couple flags in the DLL header. I couldn't tell you if NVidia has fixed this, or what version, as I haven't tried anything new. But, if you are hitting the error, then just run the script on any DLL you think that pytorch is pulling in. It really can't hurt anything. – Hamite 24/10, 2022 at 23:57

I have changed 'num_workers = 10' to 'num_workers = 1'. It helped me to solve the problem.

Talbott answered 9/7, 2022 at 9:45 Comment(1)

Slows down terribly, but at least it runs. – Pencil 5/3, 2023 at 11:54

For my case system is already set to system managed size, yet I have same error, that is because I pass a big sized variable to multiple processes within a function. Likely I need to set a very large paging file as Windows cannot create it on the fly, but instead opt out to reduce number of processes as it is not an always to be used function.

If you are in Windows it may be better to use 1 (or more) core less than total number of pysical cores as multiprocessing module in python in Windows tends to get everything as possible if you use all and actually tries to get all logical cores.

import multiprocessing
multiprocessing.cpu_count()
12  
# I actually have 6 pysical cores, if you use this as base it will likely hog system


import psutil 
psutil.cpu_count(logical = False)
6 #actual number of pysical cores

psutil.cpu_count(logical = True)
12 #logical cores (e.g. hyperthreading)

Please refer to here for more detail: Multiprocessing: use only the physical cores?

Actual answered 26/3, 2021 at 10:3 Comment(0)

Well, i managed to resolve this. open "advanced system setting". Go to the advanced tab then click settings related to performance. Again click on advanced tab--> change --> unselect 'automatically......'. for all the drives, set 'system managed size'. Restart your pc.

Stereopticon answered 20/2, 2021 at 20:4 Comment(0)

Following up on @chris-obryan's answer (I would comment but have no reputation), I've found that memory utilisation drops pretty sharply some time in to training with their fix applied (in orders of roughly the mentioned 2GB per process).

To eek out some more performance it may be worth monitoring memory utilisation and spawning a new instance of the model when these drops in memory occur, leaving enough space (~3 or 4 GB to be safe) for a bit of overhead.

I was seeings ~28GB of RAM utilised during the setup phase, which dropped to about 14GB after iterating for a while.

(Note that my use case is a little different here as I'm bottlenecked by host<->device transfers due to optimising with a GA, as a reasonable amount of CPU bound processing needs to occur after each generation, so this could play in to it. I am also using concurrent.futures.ProcessPoolExecutor() rather than manually using subprocesses)

Busk answered 12/12, 2021 at 18:46 Comment(0)

To fix this problem, I updated the CUDA 11.8.0 version and PyTorch to the 11.6 cudatoolkit version with PyTorch 1.9.1. Using conda:

conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge

Thanks to @chris-obryan I understood the problem and thought an update was available already. I measured the memory consumption before and after the updates, dropping sharply.

Auricula answered 26/10, 2022 at 23:10 Comment(1)

Thanks! Confirming that this works on linux, RAM usage per process goes from ~2.5 GB to ~1.1GB. Cuda 11.7, Pytorch 11.13. Note that I had to add "-c nvidia" for the command to install the GPU version of pytorch: "conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c nvidia -c conda-forge" – Watkins 4/11, 2022 at 13:4

Since it seems that each import torch loads a bunch of fat DLLs (thanks @chris-obryan), I tried changing this:

import torch

if __name__ == "__main__":
  # multiprocessing stuff, paging file errors

to this...

if __name__ == "__main__":
  import torch
  # multiprocessing stuff

And it worked well (because when the subprocesses are created __name__ is not "__main__").

Not an elegant solution, but perhaps useful to someone.

Ehrenberg answered 16/9, 2022 at 7:9 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags