bashawkgeohashing

How to use a general list of conditions in awk filtering?


I use awk to filter geohash data in a csv file MYFILE, where the first column is the geohash.

cat $MYFILE | awk -F ',' '{if( $1 = "^f20" || $1 = "^f28") print $0 }' > extractedFile

The filtering is smooth and fast

But now the conditions f20 and f28 can be anything, given as input in a text file.

I have written this bash loop, but the speed is very low

ZONE=myZones.txt
IFS=$'\n' read -d '' -r -a ZONES < $ZONE
NZONES=${#ZONES[@]}

  cat $MYFILE | while read line
    do
        geohash=`echo $line | cut -d ',' -f 16`
        
        for zonehash in ${ZONES[@]}
        do
            if [[ $geohash == $zonehash* ]]
            then
                echo $line >> $MYOUTPUTFILE
            else
                continue
            fi
        done
    done

How can I adapt the awk command, so that the part noted XXXX be built accordingly to the input list of hashes?

cat $MYFILE | awk -F ',' '{if( XXXX ) print $0 }' > extractedFile

Edited after comments:

Given the following input files

MYFILE.csv 1,someData,f244qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f244qeb1z2nv

ZONES.txt f24

awk 'NR==FNR{ZONES[$0]=$0; next} $1 in ZONES {print $0}' ZONES.txt MYFILE.csv


Solution

  • A simplified version of your bash logic in awk -

    awk 'NR==FNR{ZONES[$0]=$0; next} $1 in ZONES' zones file
    

    zones is the lookup table to load.
    NR==FNR checks to see if the Number of the Record (overall) is the same as the Number of the Record from this File - in other words, if this is the first file.
    If so, load it into the lookup table ZONES and move on.

    If NR is not the same as FNR then it's not the first file any more, so the lookup table is loaded and we're processing data.
    $1 in ZONES asks if field 1 is in the lookup table, and if it is, does whatever's in the curlies.


    #edit

    Looking at your added data above, I made a slightly larger file with lines that match f20, f24, and f28.

    zones:

    $: cat zones
    a
    f20
    b
    f24
    c
    f28
    

    MYFILE.csv:

    $: cat MYFILE.csv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f204qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f244qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f284qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
    
    
    $: awk -F, 'NR==FNR{ZONES[$0]=$0; next} substr($7,0,3) in ZONES' zones MYFILE.csv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f204qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f244qeb1z2nv
    1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f284qeb1z2nv
    

    Hope that helps.