I use awk to filter geohash data in a csv file MYFILE, where the first column is the geohash.
cat $MYFILE | awk -F ',' '{if( $1 = "^f20" || $1 = "^f28") print $0 }' > extractedFile
The filtering is smooth and fast
But now the conditions f20 and f28 can be anything, given as input in a text file.
I have written this bash loop, but the speed is very low
ZONE=myZones.txt
IFS=$'\n' read -d '' -r -a ZONES < $ZONE
NZONES=${#ZONES[@]}
cat $MYFILE | while read line
do
geohash=`echo $line | cut -d ',' -f 16`
for zonehash in ${ZONES[@]}
do
if [[ $geohash == $zonehash* ]]
then
echo $line >> $MYOUTPUTFILE
else
continue
fi
done
done
How can I adapt the awk command, so that the part noted XXXX be built accordingly to the input list of hashes?
cat $MYFILE | awk -F ',' '{if( XXXX ) print $0 }' > extractedFile
Edited after comments:
Given the following input files
MYFILE.csv 1,someData,f244qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f244qeb1z2nv
ZONES.txt f24
awk 'NR==FNR{ZONES[$0]=$0; next} $1 in ZONES {print $0}' ZONES.txt MYFILE.csv
A simplified version of your bash
logic in awk
-
awk 'NR==FNR{ZONES[$0]=$0; next} $1 in ZONES' zones file
zones
is the lookup table to load.
NR==FNR
checks to see if the Number of the Record (overall) is the same as the Number of the Record from this File - in other words, if this is the first file.
If so, load it into the lookup table ZONES and move on.
If NR is not the same as FNR then it's not the first file any more, so the lookup table is loaded and we're processing data.
$1 in ZONES
asks if field 1 is in the lookup table, and if it is, does whatever's in the curlies.
Looking at your added data above, I made a slightly larger file with lines that match f20, f24, and f28.
zones
:
$: cat zones
a
f20
b
f24
c
f28
MYFILE.csv
:
$: cat MYFILE.csv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f204qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f244qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f284qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f194qeb1z2nv
$: awk -F, 'NR==FNR{ZONES[$0]=$0; next} substr($7,0,3) in ZONES' zones MYFILE.csv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f204qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f244qeb1z2nv
1,someData,f194qeb4qhz1 2,someData,f2hg1rqq6hh6 3,someData,f284qeb1z2nv
Hope that helps.