pythonbashrecursion

Time different methods to recurse through directories in Linux


I wanted to find out the most efficient method to recursively count subdirectories and files, and came up with the below tests. Some seem to work, but the results are inconsistent.

I guess this post straddles the line between StackOverflow and SuperUser, but it does relate to a script, so I guess this is the right place.

#!/bin/bash

# Default to home directory if no argument is provided
dir="${1:-$HOME}"

echo "Analyzing directories and files in: $dir"
echo

# Function to time and run a command, and print the count
time_command() {
    local description="$1"
    local command="$2"
    echo "$description"
    echo "Running: $command"
    start_time=$(date +%s.%N)
    result=$(eval "$command")
    end_time=$(date +%s.%N)
    duration=$(echo "$end_time - $start_time" | bc)
    echo "Count: $result"
    echo "Time: $duration seconds"
}

# Methods to count directories
dir_methods=(
    "Directory Method 1 (find): find '$dir' -type d | wc -l"
    "Directory Method 2 (tree): tree -d '$dir' | tail -n 1 | awk '{print \$3}'"
    "Directory Method 3 (du): echo 'deprecated: usually around double length of ''find'' command'"
    "Directory Method 4 (ls): ls -lR '$dir' | grep '^d' | wc -l"
    "Directory Method 5 (bash loop): count=0; for d in \$(find '$dir' -type d); do count=\$((count + 1)); done; echo \$count"
    "Directory Method 6 (perl): perl -MFile::Find -le 'find(sub { \$File::Find::dir =~ /\\/ and \$n++ }, \"$dir\"); print \$n'"
    "Directory Method 7 (python): python3 -c 'import os; print(sum([len(dirs) for _, dirs, _ in os.walk(\"$dir\")]))'"
)

# Methods to count files
file_methods=(
    "File Method 1 (find): find '$dir' -type f | wc -l"
    "File Method 2 (tree): tree -fi '$dir' | grep -E '^[├└─] ' | wc -l"
    "File Method 3 (ls): ls -lR '$dir' | grep -v '^d' | wc -l"
    "File Method 4 (bash loop): count=0; for f in \$(find '$dir' -type f); do count=\$((count + 1)); done; echo \$count"
    "File Method 5 (perl): perl -MFile::Find -le 'find(sub { -f and \$n++ }, \"$dir\"); print \$n'"
    "File Method 6 (python): python3 -c 'import os; print(sum([len(files) for _, _, files in os.walk(\"$dir\")]))'"
)

# Run and time each directory counting method
echo "Counting directories..."
echo
for method in "${dir_methods[@]}"; do
    description="${method%%:*}"
    command="${method#*: }"
    if [[ "$description" == *"(du)"* ]]; then
        echo "$description"
        echo "Running: $command"
        eval "$command"
    else
        time_command "$description" "$command"
    fi
    echo
done

# Run and time each file counting method
echo "Counting files..."
echo
for method in "${file_methods[@]}"; do
    description="${method%%:*}"
    command="${method#*: }"
    time_command "$description" "$command"
    echo
done

Below is a run of the above. As you can see, the number of directories and files found is different in every case(!), and some of the tests are clearly broken so it would be good to know how to fix those.

Analyzing directories and files in: /home/boss

Counting directories...

Directory Method 1 (find)
Running: find '/home/boss' -type d | wc -l
Count: 598844
Time: 11.949245266 seconds

Directory Method 2 (tree)
Running: tree -d '/home/boss' | tail -n 1 | awk '{print $3}'
Count:
Time: 2.776698115 seconds

Directory Method 3 (du)
Running: echo 'deprecated: usually around double length of ''find'' command'
deprecated: usually around double length of find command

Directory Method 4 (ls)
Running: ls -lR '/home/boss' | grep '^d' | wc -l
Count: 64799
Time: 6.522804741 seconds

Directory Method 5 (bash loop)
Running: count=0; for d in $(find '/home/boss' -type d); do count=$((count + 1)); done; echo $count
Count: 604654
Time: 14.693009738 seconds

Directory Method 6 (perl)
Running: perl -MFile::Find -le 'find(sub { $File::Find::dir =~ /\/ and $n++ }, "/home/boss"); print $n'
String found where operator expected (Missing semicolon on previous line?) at -e line 1, at end of line
Unknown regexp modifier "/h" at -e line 1, at end of line
Unknown regexp modifier "/e" at -e line 1, at end of line
Can't find string terminator '"' anywhere before EOF at -e line 1.
Count:
Time: .019156779 seconds

Directory Method 7 (python)
Running: python3 -c 'import os; print(sum([len(dirs) for _, dirs, _ in os.walk("/home/boss")]))'
Count: 599971
Time: 15.013263266 seconds

Counting files...

File Method 1 (find)
Running: find '/home/boss' -type f | wc -l
Count: 5184830
Time: 13.066028457 seconds

File Method 2 (tree)
Running: tree -fi '/home/boss' | grep -E '^[├└─] ' | wc -l
Count: 0
Time: 8.431054237 seconds

File Method 3 (ls)
Running: ls -lR '/home/boss' | grep -v '^d' | wc -l
Count: 767236
Time: 6.593778380 seconds

File Method 4 (bash loop)
Running: count=0; for f in $(find '/home/boss' -type f); do count=$((count + 1)); done; echo $count
Count: 5196437
Time: 40.861512698 seconds

File Method 5 (perl)
Running: perl -MFile::Find -le 'find(sub { -f and $n++ }, "/home/boss"); print $n'
Count: 5186461
Time: 54.353541730 seconds

File Method 6 (python)
Running: python3 -c 'import os; print(sum([len(files) for _, _, files in os.walk("/home/boss")]))'
Count: 5187084
Time: 14.910791357 seconds

Solution

  • I removed the ls methods as they were unreliable (ls doesn't just output files in directories, it also outputs directory names and totals, which shouldn't be included in neither directories nor files).

    I changed the Perl methods to take advantage of the postprocess function which only runs when leaving a directory, so no testing for file type is needed.

    I also fixed the tree methods: at least on my system, tree needs -a to include filenames starting with a dot. You can use the awk trick for both files and directories, no need to count the lines.

    # Methods to count directories
    dir_methods=(
        "Directory Method 1 (find): find '$dir' -type d | wc -l"
        "Directory Method 2 (tree): tree -afi '$dir' | tail -n 1 | awk '{print \$1}'"
        "Directory Method 5 (bash loop): count=0; for d in \$(find '$dir' -type d); do count=\$((count + 1)); done; echo \$count"
        "Directory Method 6 (perl): perl -MFile::Find -le 'find({wanted => sub {}, postprocess => sub { ++\$n }}, \"$dir\"); print \$n'"
        "Directory Method 7 (python): python3 -c 'import os; print(sum([len(dirs) for _, dirs, _ in os.walk(\"$dir\")]))'"
    )
    
    # Methods to count files
    file_methods=(
        "File Method 1 (find): find '$dir' -type f | wc -l"
        "File Method 2 (tree): tree -a '$dir' | tail -n1 | awk '{print \$3}'"
        "File Method 4 (bash loop): count=0; for f in \$(find '$dir' -type f); do count=\$((count + 1)); done; echo \$count"
        "File Method 5 (perl): perl -MFile::Find -le 'find({wanted => sub { ++\$n },postprocess => sub {--\$n}}, \"$dir\"); print \$n'"
        "File Method 6 (python): python3 -c 'import os; print(sum([len(files) for _, _, files in os.walk(\"$dir\")]))'"
    )
    

    The results are still not the same, though: when counting directories, python and tree don't count the top directory.

    If there's a file or directory with a space in its name, the "bash loop" methods counts each word separately, so it's wrong.

    If there's a file or directory with a newline in its name, even the find method is wrong. You can fix it by not printing the name at all:

        "Directory Method 1 (find): find '$dir' -type d -printf '\\n'  | wc -l"
    

    and similarly for the file. You can fix the "bash loop" in the same way.