Disk usage of conda environments appears greater than it really is when cheking directory properties? [duplicate]
Asked Answered
J

2

6

First of all I'm new to virtual environments and I don't come from a software background (also english is not my first language, so please spare mercy). I'm sure that conda environments are somehow optimized to not repeat the packages on disk, somehow using links instead. But when checking the used hard-disk space (by right click -> properties, on Linux Mint) it appears really high: over 2 GB (the env has python, numpy and pandas).

Can anybody tell me (or point in the direction) of how this works?

Janycejanyte answered 16/6, 2020 at 9:33 Comment(2)
BTW, your English seems perfect as far as I can tell.Groundsheet
Possible duplicate: Why are packages installed rather than just linked to a specific environment?Bali
G
10

2 GB seems too high for that list of packages. I just did a test. On Linux, such an environment occupies 1.2 GB. On Mac, it requires only 271 MB. (I'm not completely sure what accounts for the difference between the two, but it might be related to the different file systems.)

Are you checking the size of a single environment, or are you checking the size of the entire anaconda directory tree?

Regarding disk-saving tricks in conda: You're right, conda uses hard links (when possible) to avoid duplicating files on disk. This helps save disk space, since otherwise the same file would be duplicated across multiple environments, and in the conda package cache (pkgs). Unfortunately, conda can't create hard links to some files (for technical reasons), so it must copy those files instead.

The du tool can tell you how much disk space is occupied by a particular directory (or list of directories). It is aware of hard-links, so it avoids double-counting file sizes if the same file appears twice due to a hard-link. (I don't know if the "properties" menu item in Linux Mint behaves the same way.)

For example, I'll create two identical conda environments and check their disk usage independently:

$ conda create -n test-1 -y python numpy pandas
$ conda create -n test-2 -y python numpy pandas

$ du -h -s $(conda info --base)/envs/test-1
1.2G    /opt/miniconda/envs/test-1

$ du -h -s $(conda info --base)/envs/test-2
1.2G    /opt/miniconda/envs/test-2

But if I ask du to consider them at the same time, it will notice that some of those files in test-2 were already seen in test-1, so it won't count their sizes again:

$ du -h -s $(conda info --base)/envs/test-?
1.2G    /opt/miniconda/envs/test-1
268M    /opt/miniconda/envs/test-2

If you're curious to see which files are hard-linked, look at the output of ls -l:

$ ls -l $(conda info --base)/envs/test-1/lib/libz.so.1.2.11
-rwxrwxr-x 15 bergs flyem 109272 Sep  9  2019 /opt/miniconda/envs/test-1/lib/libz.so.1.2.11
           ^
            `-- This file has 15 different names,
                i.e. it can be found in 15 different places on disk,
                due to hard-links.

$ ls -l $(conda info --base)/envs/test-1/lib/libpython3.8.so.1.0
-rwxrwxr-x 1 bergs flyemdev 14786920 Jun 16 12:44 /opt/miniconda/envs/test-1/lib/libpython3.8.so.1.0
           ^
            `-- This file has only 1 name on disk,
                i.e. there are no other hard-links to this file.
Groundsheet answered 16/6, 2020 at 17:16 Comment(0)
T
26

If you are concerned about using up disk space, you can run this command to clean up all the temporary packages, zip files, etc that conda used to setup your environment.

conda clean --all

These files remain and can clutter up your disk over time.

I use this regularly and gain back more than a few Gigabytes each time.

Taste answered 17/6, 2020 at 4:34 Comment(0)
G
10

2 GB seems too high for that list of packages. I just did a test. On Linux, such an environment occupies 1.2 GB. On Mac, it requires only 271 MB. (I'm not completely sure what accounts for the difference between the two, but it might be related to the different file systems.)

Are you checking the size of a single environment, or are you checking the size of the entire anaconda directory tree?

Regarding disk-saving tricks in conda: You're right, conda uses hard links (when possible) to avoid duplicating files on disk. This helps save disk space, since otherwise the same file would be duplicated across multiple environments, and in the conda package cache (pkgs). Unfortunately, conda can't create hard links to some files (for technical reasons), so it must copy those files instead.

The du tool can tell you how much disk space is occupied by a particular directory (or list of directories). It is aware of hard-links, so it avoids double-counting file sizes if the same file appears twice due to a hard-link. (I don't know if the "properties" menu item in Linux Mint behaves the same way.)

For example, I'll create two identical conda environments and check their disk usage independently:

$ conda create -n test-1 -y python numpy pandas
$ conda create -n test-2 -y python numpy pandas

$ du -h -s $(conda info --base)/envs/test-1
1.2G    /opt/miniconda/envs/test-1

$ du -h -s $(conda info --base)/envs/test-2
1.2G    /opt/miniconda/envs/test-2

But if I ask du to consider them at the same time, it will notice that some of those files in test-2 were already seen in test-1, so it won't count their sizes again:

$ du -h -s $(conda info --base)/envs/test-?
1.2G    /opt/miniconda/envs/test-1
268M    /opt/miniconda/envs/test-2

If you're curious to see which files are hard-linked, look at the output of ls -l:

$ ls -l $(conda info --base)/envs/test-1/lib/libz.so.1.2.11
-rwxrwxr-x 15 bergs flyem 109272 Sep  9  2019 /opt/miniconda/envs/test-1/lib/libz.so.1.2.11
           ^
            `-- This file has 15 different names,
                i.e. it can be found in 15 different places on disk,
                due to hard-links.

$ ls -l $(conda info --base)/envs/test-1/lib/libpython3.8.so.1.0
-rwxrwxr-x 1 bergs flyemdev 14786920 Jun 16 12:44 /opt/miniconda/envs/test-1/lib/libpython3.8.so.1.0
           ^
            `-- This file has only 1 name on disk,
                i.e. there are no other hard-links to this file.
Groundsheet answered 16/6, 2020 at 17:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.