I'm currently facing an issue with the du
command taking an extended amount of time when dealing with a directory containing a large number of files.
I'm using the command to calculate the total size of the directory.
Total number of files: 956878
Here is the command I've been using:
root@DEB-NUC11PAH-G6PA2500045B:~# time du --total /var/lib/docker/volumes/data.vol/_data/out/priority.images | tail -n 1 | cut -f1
1710732
real 0m4.255s
user 0m0.671s
sys 0m2.453s
In this case, the command took approximately 4.428 seconds to execute.
I would greatly appreciate any advice or suggestions from the community on optimizing the du
command or alternative approaches to efficiently calculate the total size of a directory with a large number of files.
Most implementations of the Linux du
command (and its alternatives) are already optimized rather well. If you need only the total, add the -s
flag, to make du
print much less. (Printing a million lines can be slow, especially to the terminal.)
Most of the time is spent in the kernel waiting for filesystem I/O. To speed that up, your options are:
Store the data in memory, using the tmpfs filesystem. du
on tmpfs is extremely fast. My measurements with du
on a directory containing 956878 files on tmpfs: real 0m0.020s; user 0m0.007s; sys 0m0.014s.
Store the data on an SSD (rather than HDD). (Most probably you are already doing it based on your time measurements.)
Use a different filesystem supported by Linux. I think the default ext4 filesystem is reasonably fast even on HDD (but I don't have any benchmark results to back up this claim). Are you already using ext4? (Check the contents of /proc/mounts
, but it can get compilicated, ask a separate question on https://unix.stackexchange.com .)
Run du
multiple times. In subsequent runs, it will be faster, because some filesystem metadata is already cached in memory. If possible, configure the Linux VFS cache so that it keeps filesystem metadata in the cache forever (indefinitely). (I don't know if it is possible, ask a separate question on https://unix.stackexchange.com .)
The reason why du
on SSD is faster than HDD is that random access (i.e. seeks) is much faster on SSD. Here is (approximately) how many seeks are needed by du
on an ext4 filesystem:
For each file and directory, 1 seek to the inode. (The inode contains the file size and the pointers to the file data.)
For each directory, at least 1 seek to the dirent data. For directories containing many files, 2 or more seeks.
Thus if you have 956878 files, it will be at least 956878 seeks. An HDD can do ~200 seeks per second, so the du
will take at least ~80 minutes. (Actual ext4 filesystem performance is better, because ext4 puts inodes next to each other, so by reading an inode, multiple of them will be read and cached.) An SSD can do ~100 000 IOPS (roughly translates to seeks per second), so du
will finish in less than 10 seconds. (It's actually faster because of the cache.)