regexperlperlscript

Use Perl to count occurrences of all words in a file or in all files in a directory


So I am trying to write a Perl script which will take in 3 arguments.

  1. First argument is the input file or directory.
    • If it is a file, it will count number of occurrences of all words
    • If it is a directory, it will recursively go through each directory and get all the number of occurrences for all words in the files within those directories
  2. Second argument is a number that will be how many of the words to display with the highest number of occurrences.
    • This will print to the console only the number for each word
  3. Print them to an output file which is the third argument in the command line.

It seems to be working as far as recursively searching through directories and finding all occurrences of the words in a file and prints them to the console.

How can I print these to an output file and also, how would I take the second argument, which is the number, say 5, and have it print to the console the number of words with the most occurrences while printing the words to the output file?

The following is what I have so far:

#!/usr/bin/perl -w

use strict;

search(shift);

my $input  = $ARGV[0];
my $output = $ARGV[1];
my %count;

my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
    print("This is a file!\n");
    while ( my $line = <$filename> ) {
        chomp $line;
        foreach my $str ( $line =~ /\w+/g ) {
            $count{$str}++;
        }
    }
    foreach my $str ( sort keys %count ) {
        printf "%-20s %s\n", $str, $count{$str};
    }
}
close($filename);
if ( -d $input ) {

    sub search {
        my $path = shift;
        my @dirs = glob("$path/*");
        foreach my $filename (@dirs) {
            if ( -f $filename ) {
                open( FILE, $filename ) or die "ERROR: Can't open file";
                while ( my $line = <FILE> ) {
                    chomp $line;
                    foreach my $str ( $line =~ /\w+/g ) {
                        $count{$str}++;
                    }
                }
                foreach my $str ( sort keys %count ) {
                    printf "%-20s %s\n", $str, $count{$str};
                }
            }
            # Recursive search
            elsif ( -d $filename ) {
                search($filename);
            }
        }
    }
}

Solution

  • I have figured it out. The following is my solution. I'm not sure if it's the best way to do it, but it works.

        # Check if there are three arguments in the commandline
        if (@ARGV < 3) {
           die "ERROR: There must be three arguments!\n";
           exit;
        }
        # Open the file
        my $file = shift or die "ERROR: $0 FILE\n";
        open my $fh,'<', $file or die "ERROR: Could not open file!";
        # Check if it is a file
        if (-f $fh) {
           print("This is a file!\n");
           # Go through each line
           while (my $line = <$fh>) {
              chomp $line;
              # Count the occurrences of each word
              foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
                 $count{$str}++;
              }
           }
        }
    
        # Check if the INPUT is a directory
        if (-d $input) {
           # Call subroutine to search directory recursively
           search_dir($input);
        }
        # Close the file
        close($fh);
        $high_count = 0;
        # Open the file
        open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
        # Sort the most occurring words in the file and print them
        foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
           $high_count++;
           if ($high_count <= $num) {
              printf "%-31s %s\n", $str, $count{$str};
           }
           printf $fileh "%-31s %s\n", $str, $count{$str};
        }
        exit;
    
        # Subroutine to search through each directory recursively
        sub search_dir {
           my $path = shift;
           my @dirs = glob("$path/*");
           # Loop through filenames
           foreach my $filename (@dirs) {
              # Check if it is a file
              if (-f $filename) {
                 # Open the file
                 open(FILE, $filename) or die "ERROR: Can't open file";
                 # Go through each line
                 while (my $line = <FILE>) {
                    chomp $line;
                    # Count the occurrences of each word
                    foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
                       $count{$str}++;
                    }
                 }
                 # Close the file
                 close(FILE);
              }
              elsif (-d $filename) {
                 search_dir($filename);
              }
           }
        }