I need some help with sed or awks.
How can i remove a line only if it is followed by a line that starts with the same character (in this case >)?
Example I have this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422298
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422294
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
I want to get this:
>1_SRR1422294
ATCGTCAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAT
>2_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>5_SRR1422294
CGTCAGACGTAGGGTTGCGCTCGTTGCGGGACTTAACCCAACATCTCACGACACGAGCTGACGACAGCCATGCAG
>6_SRR1422250
TGTTCATGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
>9_SRR1422294
GCGACTAGGTAGGGTTGCGCTCGTTGCGGGACTTAACCCACATCTCACGACACGAGCTGACGACAGCCATGCAGC
Note that not all the lines have the same numbers but they all have the same format, which is why I want to use regular expressions. If you could explain how to read the code you produce that would be really great.
Thank you so much!
If the whole file follows that pattern (some number of lines starting with >
, of which you want only the last, followed by a single line that should always be printed), you could use something like this:
awk '/^>/ { latest=$0 } !/^>/ { if (latest) { print latest; latest="" } print }'
If the line starts with >
, then it is remembered (stored in the variable latest
) but not printed. If the line doesn't start with >
, then it is printed, but only after first printing whatever was most recently stored in latest
.
The conditional means each printed >
line will appear only once, even if there are multiple non->
lines in a row. Since that doesn't happen in your sample data, you may not need the complication, and could use this simpler unconditional version:
awk '/^>/ { latest=$0 } !/^>/ { print latest; print }'