perldirectorymatchingfile-find

Reading in document with similar filename from directory


I have a batch of annual corporate filings, each named using the following format: company identifier, two digit year, and a random set of digits (e.g., 00000217-12-00010.txt). I want to compare the contents of each annual filing to the filing submitted by the same company in the prior year (e.g., 000002178-13-00010.txt compared to 000002178-12-00005.txt). As I loop through each file, how can I identify the preceding year’s filing for each document so that I can read both documents in as separate strings?

use strict ;
use warnings ;
use autodie ;
use File::Find  ;

### BEGIN BY READING IN EACH FILE ONE BY ONE. ###
################## LOOP BEGIN ##################
# Process every file with a `txt` file type

my $parent = "D:/Cleaned 10Ks" ;
my ($par_dir, $sub_dir);
opendir($par_dir, $parent);

while (my $sub_folders = readdir($par_dir)) {
next if ($sub_folders =~ /^..?$/);  # skip . and ..
my $path = $parent . '/' . $sub_folders;
next unless (-d $path);   # skip anything that isn't a directory
chdir($path) or die "Cant chdir to $path $!";

for my $filename ( grep -f, glob('*') ) {
#### FIND THE PRIOR YEAR'S CORRESPONDING FILING AND READ BOTH IN AS STRINGS###

Solution

  • Parse the filename for the components, say by splitting on -, and then you can reduce the year by 1 and reassemble the name. The snag is the date -- if the year is 00 you can't just subtract 1. A proper way is to use a module for dates, but since 00 is the only tricky case you can do it manually.

    my ($comp_id, $year) = split '-', $filename;
    
    my $prev_year = ($year ne '00') ? $year - 1 : 99;
    
    my $prev_year_base   = join '-', $comp_id, $year;
    
    my ($prev_year_file) = glob "$prev_year_base*";
    

    Only the first two fields are asked for from split, since the rest differs between files. The last year's filename is completed by globbing on these two components, taken to make it unique. If there may be other entries with names beginning the same way, the return from glob should be processed. Since glob returns a list (here with one element) we need () around that (sole) filename.