I wanted to find out the most efficient method to recursively count subdirectories and files, and came up with the below tests. Some seem to work, but the results are inconsistent.
I guess this post straddles the line between StackOverflow and SuperUser, but it does relate to a script, so I guess this is the right place.
#!/bin/bash
# Default to home directory if no argument is provided
dir="${1:-$HOME}"
echo "Analyzing directories and files in: $dir"
echo
# Function to time and run a command, and print the count
time_command() {
local description="$1"
local command="$2"
echo "$description"
echo "Running: $command"
start_time=$(date +%s.%N)
result=$(eval "$command")
end_time=$(date +%s.%N)
duration=$(echo "$end_time - $start_time" | bc)
echo "Count: $result"
echo "Time: $duration seconds"
}
# Methods to count directories
dir_methods=(
"Directory Method 1 (find): find '$dir' -type d | wc -l"
"Directory Method 2 (tree): tree -d '$dir' | tail -n 1 | awk '{print \$3}'"
"Directory Method 3 (du): echo 'deprecated: usually around double length of ''find'' command'"
"Directory Method 4 (ls): ls -lR '$dir' | grep '^d' | wc -l"
"Directory Method 5 (bash loop): count=0; for d in \$(find '$dir' -type d); do count=\$((count + 1)); done; echo \$count"
"Directory Method 6 (perl): perl -MFile::Find -le 'find(sub { \$File::Find::dir =~ /\\/ and \$n++ }, \"$dir\"); print \$n'"
"Directory Method 7 (python): python3 -c 'import os; print(sum([len(dirs) for _, dirs, _ in os.walk(\"$dir\")]))'"
)
# Methods to count files
file_methods=(
"File Method 1 (find): find '$dir' -type f | wc -l"
"File Method 2 (tree): tree -fi '$dir' | grep -E '^[├└─] ' | wc -l"
"File Method 3 (ls): ls -lR '$dir' | grep -v '^d' | wc -l"
"File Method 4 (bash loop): count=0; for f in \$(find '$dir' -type f); do count=\$((count + 1)); done; echo \$count"
"File Method 5 (perl): perl -MFile::Find -le 'find(sub { -f and \$n++ }, \"$dir\"); print \$n'"
"File Method 6 (python): python3 -c 'import os; print(sum([len(files) for _, _, files in os.walk(\"$dir\")]))'"
)
# Run and time each directory counting method
echo "Counting directories..."
echo
for method in "${dir_methods[@]}"; do
description="${method%%:*}"
command="${method#*: }"
if [[ "$description" == *"(du)"* ]]; then
echo "$description"
echo "Running: $command"
eval "$command"
else
time_command "$description" "$command"
fi
echo
done
# Run and time each file counting method
echo "Counting files..."
echo
for method in "${file_methods[@]}"; do
description="${method%%:*}"
command="${method#*: }"
time_command "$description" "$command"
echo
done
Below is a run of the above. As you can see, the number of directories and files found is different in every case(!), and some of the tests are clearly broken so it would be good to know how to fix those.
Analyzing directories and files in: /home/boss
Counting directories...
Directory Method 1 (find)
Running: find '/home/boss' -type d | wc -l
Count: 598844
Time: 11.949245266 seconds
Directory Method 2 (tree)
Running: tree -d '/home/boss' | tail -n 1 | awk '{print $3}'
Count:
Time: 2.776698115 seconds
Directory Method 3 (du)
Running: echo 'deprecated: usually around double length of ''find'' command'
deprecated: usually around double length of find command
Directory Method 4 (ls)
Running: ls -lR '/home/boss' | grep '^d' | wc -l
Count: 64799
Time: 6.522804741 seconds
Directory Method 5 (bash loop)
Running: count=0; for d in $(find '/home/boss' -type d); do count=$((count + 1)); done; echo $count
Count: 604654
Time: 14.693009738 seconds
Directory Method 6 (perl)
Running: perl -MFile::Find -le 'find(sub { $File::Find::dir =~ /\/ and $n++ }, "/home/boss"); print $n'
String found where operator expected (Missing semicolon on previous line?) at -e line 1, at end of line
Unknown regexp modifier "/h" at -e line 1, at end of line
Unknown regexp modifier "/e" at -e line 1, at end of line
Can't find string terminator '"' anywhere before EOF at -e line 1.
Count:
Time: .019156779 seconds
Directory Method 7 (python)
Running: python3 -c 'import os; print(sum([len(dirs) for _, dirs, _ in os.walk("/home/boss")]))'
Count: 599971
Time: 15.013263266 seconds
Counting files...
File Method 1 (find)
Running: find '/home/boss' -type f | wc -l
Count: 5184830
Time: 13.066028457 seconds
File Method 2 (tree)
Running: tree -fi '/home/boss' | grep -E '^[├└─] ' | wc -l
Count: 0
Time: 8.431054237 seconds
File Method 3 (ls)
Running: ls -lR '/home/boss' | grep -v '^d' | wc -l
Count: 767236
Time: 6.593778380 seconds
File Method 4 (bash loop)
Running: count=0; for f in $(find '/home/boss' -type f); do count=$((count + 1)); done; echo $count
Count: 5196437
Time: 40.861512698 seconds
File Method 5 (perl)
Running: perl -MFile::Find -le 'find(sub { -f and $n++ }, "/home/boss"); print $n'
Count: 5186461
Time: 54.353541730 seconds
File Method 6 (python)
Running: python3 -c 'import os; print(sum([len(files) for _, _, files in os.walk("/home/boss")]))'
Count: 5187084
Time: 14.910791357 seconds
I removed the ls
methods as they were unreliable (ls
doesn't just output files in directories, it also outputs directory names and totals, which shouldn't be included in neither directories nor files).
I changed the Perl methods to take advantage of the postprocess
function which only runs when leaving a directory, so no testing for file type is needed.
I also fixed the tree
methods: at least on my system, tree
needs -a
to include filenames starting with a dot. You can use the awk
trick for both files and directories, no need to count the lines.
# Methods to count directories
dir_methods=(
"Directory Method 1 (find): find '$dir' -type d | wc -l"
"Directory Method 2 (tree): tree -afi '$dir' | tail -n 1 | awk '{print \$1}'"
"Directory Method 5 (bash loop): count=0; for d in \$(find '$dir' -type d); do count=\$((count + 1)); done; echo \$count"
"Directory Method 6 (perl): perl -MFile::Find -le 'find({wanted => sub {}, postprocess => sub { ++\$n }}, \"$dir\"); print \$n'"
"Directory Method 7 (python): python3 -c 'import os; print(sum([len(dirs) for _, dirs, _ in os.walk(\"$dir\")]))'"
)
# Methods to count files
file_methods=(
"File Method 1 (find): find '$dir' -type f | wc -l"
"File Method 2 (tree): tree -a '$dir' | tail -n1 | awk '{print \$3}'"
"File Method 4 (bash loop): count=0; for f in \$(find '$dir' -type f); do count=\$((count + 1)); done; echo \$count"
"File Method 5 (perl): perl -MFile::Find -le 'find({wanted => sub { ++\$n },postprocess => sub {--\$n}}, \"$dir\"); print \$n'"
"File Method 6 (python): python3 -c 'import os; print(sum([len(files) for _, _, files in os.walk(\"$dir\")]))'"
)
The results are still not the same, though: when counting directories, python and tree don't count the top directory.
If there's a file or directory with a space in its name, the "bash loop" methods counts each word separately, so it's wrong.
If there's a file or directory with a newline in its name, even the find
method is wrong. You can fix it by not printing the name at all:
"Directory Method 1 (find): find '$dir' -type d -printf '\\n' | wc -l"
and similarly for the file. You can fix the "bash loop" in the same way.