bashawkgrepzipzgrep

Using bash to query a large tab delimited file


I have a list of names and IDs (50 entries)

cat input.txt

name    ID
Mike    2000
Mike    20003
Mike    20002

And there is a huge zipped file (13GB)

zcat clients.gz

name    ID  comment
Mike    2000    foo
Mike    20002   bar
Josh    2000    cake
Josh    20002   _

My expected output is

NR  name    ID  comment
1    Mike   2000    foo
3    Mike   20002   bar

each $1"\t"$2 of clients.gz is a unique identifier. There might be some entries from input.txt that might be missing from clients.gz. Thus, I would like to add the NR column to my output to find out which are missing. I would like to use zgrep. awk takes a very long time (since I had to zcat for uncompress the zipped file I assume?)

I know that zgrep 'Mike\t2000' does not work. The NR issue I can fix with awk FNR I imagine.

So far I have:

awk -v q="'" 
'
NR > 1 {
print "zcat clients.gz | zgrep -w $" q$0q
}' input.txt |
bash > subset.txt

Solution

  • $ cat tst.awk
    BEGIN { FS=OFS="\t" }
    { key = $1 FS $2 }
    NR == FNR { map[key] = (NR>1 ? NR-1 : "NR"); next }
    key in map { print map[key], $0 }
    
    $ zcat clients.gz | awk -f tst.awk input.txt -
    NR      name    ID      comment
    1       Mike    2000    foo
    3       Mike    20002   bar