Seeming discrepancy in shutil.disk_usage()
Asked Answered
L

5

8

I am using the shutil.disk_usage() function to find the current disk usage of a particular path (amount available, used, etc.). As far as I can find, this is a wrapper around os.statvfs() calls. I'm finding that it is not giving the answers I'd expect, as comparing to the output of "du" in Linux.

I have obscured some of the paths below for company privacy reasons, but the output and code are otherwise undoctored. I am using Python 3.3.2 64-bit version.

#!/apps/python/3.3.2_64bit/bin/python3

# test of shutils.diskusage module
import shutil

BytesPerGB = 1024 * 1024 * 1024

(total, used, free) = shutil.disk_usage("/data/foo/")
print ("Total: %.2fGB" % (float(total)/BytesPerGB))
print ("Used:  %.2fGB" % (float(used)/BytesPerGB))

(total1, used1, free1) = shutil.disk_usage("/data/foo/utils/")
print ("Total: %.2fGB" % (float(total1)/BytesPerGB))
print ("Used:  %.2fGB" % (float(used1)/BytesPerGB))

Which outputs:

/data/foo/drivecode/me % disk_usage_test.py
Total: 609.60GB
Used:  291.58GB
Total: 609.60GB
Used:  291.58GB

As you can see, the main problem is I would expect the second amount for "Used" to be much smaller, as it is a subset of the first directory.

/data/foo/drivecode/me % du -sh /data/foo/utils
2.0G    /data/foo/utils

As much as I trust "du," I find it hard to believe the Python module would be incorrect either. So perhaps it is just my understanding of Linux filesystems that could be the issue. :)

I wrote a module (based heavily on someone's code here at SO) which recursively gets the disk_usage, which I was using until now. It appears to match the "du" output but is MUCH, much slower than the shutil.disk_usage() function, so I'm hoping I can make that one work.

Thanks much in advance.

Lout answered 7/10, 2013 at 23:43 Comment(1)
The function shutil.disk_usage is giving your "disk" usage not "directory" usage. What you get from it ought to be compared with df -h instead of du -sh.Mccaslin
P
10

The problem is that shutil uses the statvfs system call underneath to determine the space used. This system call has no file-path granularity as far as I'm aware, only file-system granularity. What this means is that the path you provide it with only helps to identify the file system you want to query, not the path's.

In other words, you gave it the path /data/foo/utils and then it determined which file system backs this file path. Then it queried the file system. This becomes apparent when you consider how the used parameter is defined in shutil:

used = (st.f_blocks - st.f_bfree) * st.f_frsize

Where:

fsblkcnt_t     f_blocks;   /* size of fs in f_frsize units */
fsblkcnt_t     f_bfree;    /* # free blocks */
unsigned long  f_frsize;   /* fragment size */

This is why it's giving you the total space used on the entire file system.

Indeed, it seems to me like the du command itself also traverses the file structure and adds up the file sizes. Here is GNU coreutils du command's source code.

Proteinase answered 8/10, 2013 at 0:0 Comment(0)
H
6

The shutil.disk_usage returns the disk usage (i.e. the mount point which backs the path) and not actual file usage under that path. It is equivalent of running df /path/to/mount and not du /path/to/files. Notice that for both directories you got the exact same usage.

From the docs: "Return disk usage statistics about the given path as a named tuple with the attributes total, used and free, which are the amount of total, used and free space, in bytes."

Herschelherself answered 8/10, 2013 at 0:8 Comment(2)
Thanks. I guess the question is then, is there a Python function which is more akin to file usage under a path? If there is nothing built-in, I could use os.walk(). This all explains why "du" and the Python equivalent are so fast, maybe the directory structure contains this information already (I'm conjecturing here) and it just parses it out. Whereas to get the file usage, I need to crawl the whole path and subdirectories and add up the file sizes.Lout
Indeed, there's no way to sum the amount of space used by files other than manually calculating it for each file. Take a look at these links, they may help you: #12480867 Also, notice that you'll have to round-up to block size to get the same number as du gives you: #4080754Herschelherself
D
1

Update for anyone stumbling upon this after 2013:


Depending on your Python version and OS, shutil.disk_usage might support files and directories for the path variable. Here's the breakdown:

Windows:

  • 3.3 - 3.5: only suports mountpoint/filesystem
  • 3.6 - 3.7: directory support
  • 3.8+: file & directory support

Unix:

  • 3.3 - 3.5: only suports mountpoint/filesystem
  • 3.6+: file & directory support
Digamma answered 21/8, 2020 at 13:31 Comment(1)
This still only uses the file or path passed to decide the mount point of the filesystem and then reports the whole filesystem free / total usage data. So no go.Hillis
H
0

Since it was already mentioned above that this function reports the disk_usage (as its name suggests) and not a specific folder's data usage, I took the liberty to put together a very simple and straightforward few lines of code that does calculate the folder size. As it is recursive; with billions of files in a folder it might took a while / use a bit of memory.

from pathlib import Path
from itertools import tee


def scandir(p: Path) -> int:
  files, dirs = tee(Path(p).iterdir())
  total = sum(x.stat().st_size for x in files if x.is_file())
  total += sum(scandir(x) for x in dirs if x.is_dir())
  return total
  
  
print(scandir('.'))  # path size in bytes
Hillis answered 30/5, 2023 at 19:4 Comment(0)
C
0

I solved this by calling the DOS DIR command to recursively list the folder, then extracting the reported number of bytes used. The results match what is reported by Windows Explorer.

import subprocess
get_size = lambda a_dir: int(subprocess.check_output(['dir', a_dir, '/S', '/-C'], shell = True).split()[-7])

It's obviously a Windows-only solution.

Countdown answered 19/7, 2023 at 18:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.