perlfile-find

How to optimize this Perl file find?


Requirement: To get the count of directories under the input directory that matches the following criteria

  1. the directories can have any name except "DIR1", "DIR2", "DIR3" etc.
  2. the directories inside "DIR1", "DIR2", "DIR3" etc. need not be counted
  3. need the count of directories alone, no files
use strict;
use File::Find;

my ($inputdir) = @ARGV; 
my (@branches, $branch, $directory, @directories); 
my $count = 0; 

find(\&wanted, $inputdir); 
    while ( defined($directory = shift @directories) ) {
          if (-d $directory){ 
             next if ($directory =~ "DIR1" || $directory =~ "DIR2" || $directory =~ "DIR3"); 
                     push @branches, $directory; 
                     $count++; 
          }
    } 

print "Total number of directories: $count \n"; 

sub wanted{
    push @directories, $File::Find::name;
    return @directories; 
}

This piece of code is giving the required output but it's taking quite a lot of time.

Please suggest ways to reduce the time taken by improving this code.


Solution

  • The File::Find::Rule can skip whole branches altogether

    use warnings;
    use strict;
    
    use File::Find::Rule;
    
    my $start_dir = shift || '.';
    
    my $re_skip = qr/DIR(?:1|2|3)/;
    
    my $ok   = File::Find::Rule->directory;  # add selection rules as needed
    my $skip = File::Find::Rule->directory
        ->name(qr/$re_skip/)
        ->prune
        ->discard; 
    
    my @dirs = File::Find::Rule -> any($skip, $ok) -> in($start_dir); 
    
    print "Total: ", scalar @dirs, "\n";
    

    This still has to take some time with a large filesystem but it will be much better.

    In a one-liner, if all you need from this is just a quick count

    perl -MFile::Find::Rule -wE'
        $ffr = File::Find::Rule; 
        $skip = $ffr->directory->name(qr/DIR(?:1|2|3)/)->prune->discard; 
        say scalar $ffr->any($skip, $ffr->directory)->in(".")'
    

    where I've consolidated some of the code from the script.

    The next step would be to use multi-threaded execution (I'd use fork here). Group subdirectories so that they are roughly balanced in their sub-counts and run something like the above in parallel over those groups. The gain will depend on your hardware but there should be a good speedup factor.