I have a tsv file like this:
chr1 28932 29543 chr1 29159 29422 RNAPOLII_T1_pos_1_q05_peak_1 114 . 5.55679 14.5827 11.4511 119
chr1 199425 200055 . -1 -1 . . . . . . .
chr1 206917 207235 . -1 -1 . . . . . . .
chr1 629342 630035 chr1 629392 629981 RNAPOLII_T1_pos_1_q05_peak_2 89 . 1.53473 11.9814 8.95881 434
chr1 630824 631475 chr1 630904 631286 RNAPOLII_T1_pos_1_q05_peak_3 110 . 1.66136 14.1185 11.0065 34
chr1 631947 632282 . -1 -1 . . . . . . .
chr1 632546 632864 chr1 632596 632814 RNAPOLII_T1_pos_1_q05_peak_4 53 . 1.45791 8.17161 5.34813 45
chr1 633792 634430 chr1 634016 634206 RNAPOLII_T1_pos_1_q05_peak_5 42 . 1.40136 6.99814 4.24691 25
chr1 634453 634840 chr1 634503 634790 RNAPOLII_T1_pos_1_q05_peak_6 68 . 1.68267 9.80195 6.88384 32
chr1 778082 779111 chr1 778407 778997 RNAPOLII_T1_pos_1_q05_peak_7 290 . 8.3336 32.7328 29.0707 207
chr1 827049 827851 chr1 827150 827773 RNAPOLII_T1_pos_1_q05_peak_8 43 . 3.42454 7.13586 4.37707 251
chr1 941573 941926 chr1 941623 941876 RNAPOLII_T1_pos_1_q05_peak_9 48 . 3.83227 7.61827 4.82768 136
chr1 989375 989734 . -1 -1 . . . . . . .
chr1 990673 991342 . -1 -1 . . . . . . .
chr1 991736 992432 chr1 991990 992382 RNAPOLII_T1_pos_1_q05_peak_10 58 . 4.33261 8.71042 5.8516 205
chr1 992407 994252 chr1 992698 993311 RNAPOLII_T1_pos_1_q05_peak_11 62 . 3.89152 9.08737 6.20787 479
chr1 992407 994252 chr1 993534 994152 RNAPOLII_T1_pos_1_q05_peak_12 60 . 3.39559 8.88015 6.01409 170
chr1 994237 998788 chr1 994346 998738 RNAPOLII_T1_pos_1_q05_peak_13 633 . 13.9139 67.4929 63.32 2194
chr1 998775 1002233 chr1 998825 1002089 RNAPOLII_T1_pos_1_q05_peak_14 850 . 19.1217 89.4139 85.0549 1234
chr1 1004118 1004538 . -1 -1 . . . . . . .
chr1 1005008 1006499 chr1 1005058 1005522 RNAPOLII_T1_pos_1_q05_peak_15 55 . 4.46653 8.38165 5.54531 345
chr1 1019994 1020390 . -1 -1 . . . . . . .
chr1 1020344 1020662 . -1 -1 . . . . . . .
chr1 1078905 1080785 chr1 1079111 1079300 RNAPOLII_T1_pos_1_q05_peak_16 48 . 3.07279 7.6091 4.82217 93
chr1 1078905 1080785 chr1 1079358 1079899 RNAPOLII_T1_pos_1_q05_peak_17 90 . 4.56426 12.0559 9.03203 158
chr1 1157419 1158008 chr1 1157469 1157958 RNAPOLII_T1_pos_1_q05_peak_18 113 . 5.84903 14.4751 11.3505 128
chr1 1216203 1216549 . -1 -1 . . . . . . .
chr1 1216526 1216931 . -1 -1 . . . . . . .
chr1 1231559 1232418 chr1 1231766 1232368 RNAPOLII_T1_pos_1_q05_peak_19 175 . 7.74351 20.8689 17.5159 180
chr1 1248702 1249624 . -1 -1 . . . . . . .
I want to use awk to select the data in column number 5, but only for unique values in column 2. For example, at lines 16/17, the value 992407
is repeated. I only want to keep the first value in col 5 for these coordinates, 992698
. Any duplicates should be immediately one after the other, so I wrote this awk line to filter the file:
awk 'BEGIN {$prev=-1} { if($2 != $prev){ print $5; $prev=$2 }}'
which to me should exclude any lines where a value in column 2 is identical to the value in column 2 found at the line just before. However, no lines are filtered when I apply this. What am I missing?
If you are using $i
in GNU AWK it means i-th field (of current row), if i was not set, it is assumed to be 0, therefore
awk 'BEGIN {$prev=-1} { if($2 != $prev){ print $5; $prev=$2 }}'
is same as doing
awk 'BEGIN {$0=-1} { if($2 != $0){ print $5; $0=$2 }}'
condition inside if
will never hold for your data (or any other multi-column data) as $0
denotes whole line GNU AWK.
You should assign to and compare against prev
variable and you might use condition as pattern, without if
, that is
awk 'BEGIN {prev=-1}($2 != prev){ print $5; prev=$2 }'