Assume a tab-separated file as input file 1 ($IN1
):
$ cat input1.tsv
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo
Foo CD789123 Qux Baz GH987124 Qux
Further assume a csv-separated file as input file 2 ($IN2
):
$ cat input2.csv
AB_123456,CD789123
ZX_999876,MN111111
Input file 2 specifies which lines of input file 1 need to be deleted. Specifically, if column 1 entry of input file 2 is equal to the column 2 entry of input file 1, then all lines of input file 1 in which column 2 entry of input file 2 is equal to the column 2 entry of input file 1 are to be removed.
Desired output:
$ cat input1_filtered.tsv
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo
I wish to achieve the desired output via Bash. My (so far insufficient) attempt:
while IFS=, read -r CL1 CL2
do
if awk -F"\t" '$2=="$CL1"' $IN1
then
grep -vPw "(?<=\t)$CL2" $IN1 > tmp.txt
mv tmp.txt $IN1
fi
done < $IN2
This uses GNU awk specifically:
gawk '
BEGINFILE { FS = ARGIND == 1 ? "," : "\t" }
ARGIND == 1 { is_present[$1] = $2; next }
ARGIND == 2 { if ($2 in is_present) del[is_present[$2]] = 1; next }
!($2 in del)
' input2.csv input1.tsv input1.tsv
outputs
HEAD1 HEAD2 HEAD3 HEAD4 HEAD5 HEAD6
Qux ZX_999876 Bar Foo MN111111 Quux
Foo AB_123456 Bar Baz CD789123 Qux
Bar AC_456321 Baz Qux GF333444 Foo