pythonunixdiskspaceoverhead

Does invoking a system call like statfs with Python subprocess use less overhead than invoking a C utility like df?


Unix-based system. I'm trying to use as little overhead as possible right now in the code I'm working on (it's in a resource constrained space). In this particular code, we are gathering some basic disk usage stats. One suggestion was to replace a call to df with statfs since df is a C utility that requires its own subprocess to run whereas statfs is a system call which presumably uses less overhead (and is what df calls anyway).

We're calling df with Python's subprocess.check_output() command:

import subprocess

DF_CMD = ["df", "-P", "-k"]


def get_disk_usage() -> str:
    try:
        output = subprocess.check_output(DF_CMD, text=True)
    except subprocess.CalledProcessError as e:
        raise RuntimeError(f"Failed to execute {DF_CMD} " + str(e)) from e

    return output

I want to hard code our mount points (which we decided we're okay with) and replace the call to df with a call to statfs <mountpoint> in the above code. However, I'm unsure if calling with the same Python function will actually reduce overhead. I plan to use a profiler to check it, but I'm curious if anyone knows enough about the inner workings of Python/Unix to know what's going on under the hood?

And to be clear: by "overhead" I mean CPU and memory usage on the OS/machine.


Solution

  • However, I'm unsure if calling with the same Python function will actually reduce overhead

    Spawning a new process - fork and execve - are generally extremely costly syscalls. They are the reason why the shell is so slow - almost every functionality in the shell is a separate process, and the shell also spawns subshells in many contexts. Nowadays, computers are anyway magnitudes extremely faster, the cost of spawning a new process is negligible. There are thousands of processes on nowadays computers.

    Yes, replacing subprocess with os.statvfs will reduce the overhead. Unless you are working on a really really resource constrained device, like, I don't know, 64MB of memory, it is usually not worth the time, but it is very nice to do to make the code self-contained and clean and reduce the amount of possible errors. Python is "very" memory consuming anyway, so the act of running it already implicates to me that you probably have more than enough resources to spawn a single subprocess.