bashawkduplicatesunix-text-processing

Remove duplicates ignoring specific columns


I want to remove all duplicates from a file but ignoring the first 2 columns, I mean don't comparing those columns.

This is my example input:

111  06:22  apples, bananas and pears
112  06:28  bananas
113  07:07  apples, bananas and pears
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears
117  09:22  apples, bananas and pears
118  12:23  apples and bananas

I want this output:

111  06:22  apples, bananas and pears
112  06:28  bananas
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears

I've tried this bellow, but it only compares the third column and ignores the rest of the line:

awk '!seen[$3]++' sample.txt

Solution

  • Store $0 to a temporary variable, set $1 and $2 to empty, then use newly composed $0 as key:

    awk '{ t = $0; $1 = $2 = "" } !seen[$0]++ { print t }' sample.txt