pythonlinuxshutil

Seeming discrepancy in shutil.disk_usage()


I am using the shutil.disk_usage() function to find the current disk usage of a particular path (amount available, used, etc.). As far as I can find, this is a wrapper around os.statvfs() calls. I'm finding that it is not giving the answers I'd expect, as comparing to the output of "du" in Linux.

I have obscured some of the paths below for company privacy reasons, but the output and code are otherwise undoctored. I am using Python 3.3.2 64-bit version.

#!/apps/python/3.3.2_64bit/bin/python3

# test of shutils.diskusage module
import shutil

BytesPerGB = 1024 * 1024 * 1024

(total, used, free) = shutil.disk_usage("/data/foo/")
print ("Total: %.2fGB" % (float(total)/BytesPerGB))
print ("Used:  %.2fGB" % (float(used)/BytesPerGB))

(total1, used1, free1) = shutil.disk_usage("/data/foo/utils/")
print ("Total: %.2fGB" % (float(total1)/BytesPerGB))
print ("Used:  %.2fGB" % (float(used1)/BytesPerGB))

Which outputs:

/data/foo/drivecode/me % disk_usage_test.py
Total: 609.60GB
Used:  291.58GB
Total: 609.60GB
Used:  291.58GB

As you can see, the main problem is I would expect the second amount for "Used" to be much smaller, as it is a subset of the first directory.

/data/foo/drivecode/me % du -sh /data/foo/utils
2.0G    /data/foo/utils

As much as I trust "du," I find it hard to believe the Python module would be incorrect either. So perhaps it is just my understanding of Linux filesystems that could be the issue. :)

I wrote a module (based heavily on someone's code here at SO) which recursively gets the disk_usage, which I was using until now. It appears to match the "du" output but is MUCH, much slower than the shutil.disk_usage() function, so I'm hoping I can make that one work.

Thanks much in advance.


Solution

  • The problem is that shutil uses the statvfs system call underneath to determine the space used. This system call has no file-path granularity as far as I'm aware, only file-system granularity. What this means is that the path you provide it with only helps to identify the file system you want to query, not the path's.

    In other words, you gave it the path /data/foo/utils and then it determined which file system backs this file path. Then it queried the file system. This becomes apparent when you consider how the used parameter is defined in shutil:

    used = (st.f_blocks - st.f_bfree) * st.f_frsize
    

    Where:

    fsblkcnt_t     f_blocks;   /* size of fs in f_frsize units */
    fsblkcnt_t     f_bfree;    /* # free blocks */
    unsigned long  f_frsize;   /* fragment size */
    

    This is why it's giving you the total space used on the entire file system.

    Indeed, it seems to me like the du command itself also traverses the file structure and adds up the file sizes. Here is GNU coreutils du command's source code.