linuxdu

Optimizing `du` Command Performance with Large Number of Files


I'm currently facing an issue with the du command taking an extended amount of time when dealing with a directory containing a large number of files. I'm using the command to calculate the total size of the directory.

Total number of files: 956878

Here is the command I've been using:

root@DEB-NUC11PAH-G6PA2500045B:~# time du --total /var/lib/docker/volumes/data.vol/_data/out/priority.images | tail -n 1 | cut -f1
1710732
real    0m4.255s
user    0m0.671s
sys 0m2.453s

In this case, the command took approximately 4.428 seconds to execute.

I would greatly appreciate any advice or suggestions from the community on optimizing the du command or alternative approaches to efficiently calculate the total size of a directory with a large number of files.


Solution

  • Most implementations of the Linux du command (and its alternatives) are already optimized rather well. If you need only the total, add the -s flag, to make du print much less. (Printing a million lines can be slow, especially to the terminal.)

    Most of the time is spent in the kernel waiting for filesystem I/O. To speed that up, your options are:

    The reason why du on SSD is faster than HDD is that random access (i.e. seeks) is much faster on SSD. Here is (approximately) how many seeks are needed by du on an ext4 filesystem:

    Thus if you have 956878 files, it will be at least 956878 seeks. An HDD can do ~200 seeks per second, so the du will take at least ~80 minutes. (Actual ext4 filesystem performance is better, because ext4 puts inodes next to each other, so by reading an inode, multiple of them will be read and cached.) An SSD can do ~100 000 IOPS (roughly translates to seeks per second), so du will finish in less than 10 seconds. (It's actually faster because of the cache.)