bashawktac

Faster way of Awking a file from End to Beginning?


I want to get results starting at the bottom of a file and working my way up to the beginning. I tried using tac and pipe that into my awk command, but its very slow (15 seconds for an 2GB file). Compared to searching normally (3 seconds for the same file). I'm also piping the awk command into tail -n +1 | head -n 50 to stop after 50 results.

Is there a faster way to tac a file? or at least start searching from the bottom up?

The big picture is to create a python script that takes arguments (start date, end date, search terms) and use those to search through a date organized log file. Returning 50 results at a time.

I need to read from end to beginning in case a user wants to search in Descending order (newest date to oldest date).

An example command for ascending results ('oldest to newest): (im using find, because is a user specified argument, it can potentially reference all files (*.txt))

find '/home/logs/' -type f -name 'log_file.txt' -exec cat {} \+ 2>&1| LANC=C fgrep 'Potato' | LC_ALL=C IGNORECASE=1 awk -v start="2018-03-04T03:45:00" -v stop="2018-03-05T16:24:59" 'BEGIN{IGNORECASE=1;} {line=$0; xz=" "; for(i=4;i<=NF;i++){xz=xz" "$i};} ($1>=start&&$1<=stop) && (tolower(xz) ~ /Potato/) {print line}' | tail -n +1 | head -n 50

The tail -n +1 | head -n 50 is to return the first 50 matches.

This command takes about 3-4 seconds to find results, however if I sub in tac, it takes closer to 20 seconds.


Solution

  • Much faster to open the file, and seek to some amount before the end of the file. Perl is handy here:

    perl -Mautodie -se '
        $size = -s $file;
        $blocksize = 64000;
        open $fh, "<", $file;
        seek $fh, $size - $blocksize, 0;
        read $fh, $data, $blocksize;
        @lines = split "\n", $data;
        # last 50 lines
        print join "\n", reverse @lines[-51..-1];
    ' -- -file="filename"
    

    We can throw a loop in there so after it reads the last block, it can seek to the end minus 2 blocks and read a block, etc.

    But if you want to process the entire gigantic file from bottom to top, you'll have to expect it to take time.