perl

How to search a file for the last block of consecutive lines that contain a keyword in Perl


Imagine a text file like below where <some random text> could be anything or nothing, implying the KEYWORD can appear anywhere in the line, alone or along with other text:

 1 <some random text>
 2 <some random text>KEYWORD<some random text>
 3 <some random text>KEYWORD<some random text>
 4 <some random text>
 5 <some random text>
 6 <some random text>KEYWORD<some random text>
 7 <some random text>
 8 <some random text>KEYWORD<some random text>
 9 <some random text>KEYWORD<some random text>
10 <some random text>KEYWORD<some random text>
11 <some random text>
12 <some random text>KEYWORD<some random text>
13 <some random text>KEYWORD<some random text>
14 <some random text>
15 <some random text>KEYWORD<some random text>
16 <some random text>

How can I get the last occurrence of 2 or more consecutive lines that contain the keyword (lines 12 and 13 in the example)? To be clear, I am not interested in lines (8, 9, 10) because although they contain the keyword and are consecutive, they are not the last, nor in line 15 because although it contains the keyword and is the last line with keyword, it is not part of 2 or more consecutive lines.


Solution

  • Record such sequences of lines with the pattern as they come, always keeping the last set, and once the file is out you will have had the very last set. (Or read backwards if the file is large, per info added in a comment; see the second section below.)

    A straightforward way

    use warnings;
    use strict;
    use feature 'say';
    
    die "Usage: $0 file(s)\n"  if not @ARGV;
    
    my $threshold = 2;
    
    my (@buf, $cnt, @res);
    
    while (<>) {
        if (not /KEYWORD/) {
            $cnt = 0  if $cnt;
            @buf = () if @buf;
            next 
        }   
    
        ++$cnt;
        push @buf, $_; 
    
        if ($cnt >= $threshold) {
            @res = @buf;  # excessive copying; refine if a problem
        }
    }
    print for @res;
    

    Remove the @ARGV check to allow STDIN input, which <> reads with no files given.

    Notes

    If you need to know where in the file these are save the line number $., along with lines.

    If a file can be large -- and this is the only thing to be done with it -- we can use the same code but going backwards, from the end of the file. A module for that is File::ReadBackwards.


    To illustrate the performance gain, here is a program to do the same by reading the file backwards

    use warnings;
    use strict;
    use feature 'say';
    
    use File::ReadBackwards;
    
    my (@buf, $cnt, @res);
    my $threshold = 2;
    
    my $bw = File::ReadBackwards->new(shift) or die $!;     
    #print $bw->readline until $bw->eof; exit;  # test
    
    while ( my $line = $bw->readline ) {     
        if (not $line =~ /KEYWORD/) {    
            last if @res >= $threshold;
            $cnt = 0  if $cnt;
            @buf = () if @buf;
            next 
        }   
        ++$cnt;
    
        if ($cnt  < $threshold) { 
            push @buf, $line;
        }   
        elsif ($cnt == $threshold) { 
            @res = (@buf, $line);
        }   
        else { 
            push @res, $line;
        }
    }    
    print for reverse @res;
    

    This produces the same output as the program that reads from the beginning.

    I append the test file 200k times, for a file of 111 Mb in size. The first program (adjusted for performance as in notes) takes ~1.85 sec on it (average over a few runs) while the one above goes in 0.02 sec.

    So, the saving is sweet for large enough files; the small overhead in reading from the back is entirely unseen. However, no other processing can be done along the way as the front of the file is never seen at all. Also, the target must be seekable (a file), and very few operations are supported; for one, we don't get line numbers.


    This is for the whole program, startup and all, measured by time on the command line as the program is invoked, and averaged over a few runs.

    When I time just the code itself, using Time::HiRes, the runtimes to process the file are