awk

Exclude lines with duplicate values in awk


I have a tsv file like this:

chr1    28932   29543   chr1    29159   29422   RNAPOLII_T1_pos_1_q05_peak_1    114 .   5.55679 14.5827 11.4511 119
chr1    199425  200055  .   -1  -1  .   .   .   .   .   .   .
chr1    206917  207235  .   -1  -1  .   .   .   .   .   .   .
chr1    629342  630035  chr1    629392  629981  RNAPOLII_T1_pos_1_q05_peak_2    89  .   1.53473 11.9814 8.95881 434
chr1    630824  631475  chr1    630904  631286  RNAPOLII_T1_pos_1_q05_peak_3    110 .   1.66136 14.1185 11.0065 34
chr1    631947  632282  .   -1  -1  .   .   .   .   .   .   .
chr1    632546  632864  chr1    632596  632814  RNAPOLII_T1_pos_1_q05_peak_4    53  .   1.45791 8.17161 5.34813 45
chr1    633792  634430  chr1    634016  634206  RNAPOLII_T1_pos_1_q05_peak_5    42  .   1.40136 6.99814 4.24691 25
chr1    634453  634840  chr1    634503  634790  RNAPOLII_T1_pos_1_q05_peak_6    68  .   1.68267 9.80195 6.88384 32
chr1    778082  779111  chr1    778407  778997  RNAPOLII_T1_pos_1_q05_peak_7    290 .   8.3336  32.7328 29.0707 207
chr1    827049  827851  chr1    827150  827773  RNAPOLII_T1_pos_1_q05_peak_8    43  .   3.42454 7.13586 4.37707 251
chr1    941573  941926  chr1    941623  941876  RNAPOLII_T1_pos_1_q05_peak_9    48  .   3.83227 7.61827 4.82768 136
chr1    989375  989734  .   -1  -1  .   .   .   .   .   .   .
chr1    990673  991342  .   -1  -1  .   .   .   .   .   .   .
chr1    991736  992432  chr1    991990  992382  RNAPOLII_T1_pos_1_q05_peak_10   58  .   4.33261 8.71042 5.8516  205
chr1    992407  994252  chr1    992698  993311  RNAPOLII_T1_pos_1_q05_peak_11   62  .   3.89152 9.08737 6.20787 479
chr1    992407  994252  chr1    993534  994152  RNAPOLII_T1_pos_1_q05_peak_12   60  .   3.39559 8.88015 6.01409 170
chr1    994237  998788  chr1    994346  998738  RNAPOLII_T1_pos_1_q05_peak_13   633 .   13.9139 67.4929 63.32   2194
chr1    998775  1002233 chr1    998825  1002089 RNAPOLII_T1_pos_1_q05_peak_14   850 .   19.1217 89.4139 85.0549 1234
chr1    1004118 1004538 .   -1  -1  .   .   .   .   .   .   .
chr1    1005008 1006499 chr1    1005058 1005522 RNAPOLII_T1_pos_1_q05_peak_15   55  .   4.46653 8.38165 5.54531 345
chr1    1019994 1020390 .   -1  -1  .   .   .   .   .   .   .
chr1    1020344 1020662 .   -1  -1  .   .   .   .   .   .   .
chr1    1078905 1080785 chr1    1079111 1079300 RNAPOLII_T1_pos_1_q05_peak_16   48  .   3.07279 7.6091  4.82217 93
chr1    1078905 1080785 chr1    1079358 1079899 RNAPOLII_T1_pos_1_q05_peak_17   90  .   4.56426 12.0559 9.03203 158
chr1    1157419 1158008 chr1    1157469 1157958 RNAPOLII_T1_pos_1_q05_peak_18   113 .   5.84903 14.4751 11.3505 128
chr1    1216203 1216549 .   -1  -1  .   .   .   .   .   .   .
chr1    1216526 1216931 .   -1  -1  .   .   .   .   .   .   .
chr1    1231559 1232418 chr1    1231766 1232368 RNAPOLII_T1_pos_1_q05_peak_19   175 .   7.74351 20.8689 17.5159 180
chr1    1248702 1249624 .   -1  -1  .   .   .   .   .   .   .

I want to use awk to select the data in column number 5, but only for unique values in column 2. For example, at lines 16/17, the value 992407 is repeated. I only want to keep the first value in col 5 for these coordinates, 992698. Any duplicates should be immediately one after the other, so I wrote this awk line to filter the file:

awk 'BEGIN {$prev=-1} { if($2 != $prev){ print $5; $prev=$2 }}'

which to me should exclude any lines where a value in column 2 is identical to the value in column 2 found at the line just before. However, no lines are filtered when I apply this. What am I missing?


Solution

  • If you are using $i in GNU AWK it means i-th field (of current row), if i was not set, it is assumed to be 0, therefore

    awk 'BEGIN {$prev=-1} { if($2 != $prev){ print $5; $prev=$2 }}'
    

    is same as doing

    awk 'BEGIN {$0=-1} { if($2 != $0){ print $5; $0=$2 }}'
    

    condition inside if will never hold for your data (or any other multi-column data) as $0 denotes whole line GNU AWK.

    You should assign to and compare against prev variable and you might use condition as pattern, without if, that is

    awk 'BEGIN {prev=-1}($2 != prev){ print $5; prev=$2 }'