I've looked a bit into this tonight. I don't have a solution (edit: I have a mitigation, see the edit at end), but I have a bit more information.
It seems the issue is caused by NVidia fatbins (.nv_fatb) being loaded into memory. Several DLLs, such as cusolver64_xx.dll, torcha_cuda_cu.dll, and a few others, have .nv_fatb sections in them. These contain tons of different variations of CUDA code for different GPUs, so it ends up being several hundred megabytes to a couple gigabytes.
When Python imports 'torch' it loads these DLLs, and maps the .nv_fatb section into memory. For some reason, instead of just being a memory mapped file, it is actually taking up memory. The section is set as 'copy on write', so it's possible something writes into it? I don't know. But anyway, if you look at Python using VMMap ( https://learn.microsoft.com/en-us/sysinternals/downloads/vmmap ) you can see that these DLLs are committing huge amounts of committed memory for this .nv_fatb section. The frustrating part is it doesn't seem to be using the memory. For example, right now my Python.exe has 2.7GB committed, but the working set is only 148MB.
Every Python process that loads these DLLs will commit several GB of memory loading these DLLs. So if 1 Python process is wasting 2GB of memory, and you try running 8 workers, you need 16GB of memory to spare just to load the DLLs. It really doesn't seem like this memory is used, just committed.
I don't know enough about these fatbinaries to try to fix it, but from looking at this for the past 2 hours it really seems like they are the issue. Perhaps its an NVidia problem that these are committing memory?
edit: I made this python script: https://gist.github.com/cobryan05/7d1fe28dd370e110a372c4d268dcb2e5
Get it and install its pefile dependency ( python -m pip install pefile ).
Run it on your torch and cuda DLLs. In OPs case, command line might look like:
python fixNvPe.py --input=C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\*.dll
(You also want to run this wherever your cusolver64_*.dll and friends are. This may be in your torch\lib folder, or it may be, eg, C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.X\bin . If it is under Program Files, you will need to run the script with administrative privileges)
What this script is going to do is scan through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.
ASLR is 'address space layout randomization.' It is a security feature that randomizes where a DLL is loaded in memory. We disable it for this DLL so that all Python processes will load the DLL into the same base virtual address. If all Python processes using the DLL load it at the same base address, they can all share the DLL. Otherwise each process needs its own copy.
Marking the section 'read-only' lets Windows know that the contents will not change in memory. If you map a file into memory read/write, Windows has to commit enough memory, backed by the pagefile, just in case you make a modification to it. If the section is read-only, there is no need to back it in the pagefile. We know there are no modifications to it, so it can always be found in the DLL.
The theory behind the script is that by changing these 2 flags that less memory will be committed for the .nv_fatb, and more memory will be shared between the Python processes. In practice, it works. Not quite as well as I'd hope (it still commits a lot more than it uses), so my understanding may be flawed, but it significantly decreases memory commit.
In my limited testing I haven't ran into any issues, but I can't guarantee there are no code paths that attempts to write to that section we marked 'read only.' If you start running into issues, though, you can just restore the backups.
edit 2022-01-20:
Per NVIDIA: "We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature ."
This should certainly help. If it's not enough without ASLR as well then the script should still work