I have this input file:
>seq
GATGGATTCGGANNNNNNNNNNNNNNNGTTGTAGGGNNNNNNNNNNNNNNNNNNNNNNGATAGAGAGNN
>suq
AAHAHAH
And this command:
awk '{gsub(/[N]{5,}/,"\n")}1' f.fa
The current output:
>seq
GATGGATTCGGA
GTTGTAGGG
GATAGAGAGNN
>suq
AAHAHAH
If 5 or more consecutive 'N'-s are found, the string will be separated into another line. The problem is that I want the output to be like this:
>seq
GATGGATTCGGA
>seq_1
GTTGTAGGG
>seq_2
GATAGAGAGNN
>suq
AAHAHAH
Before each linebreak, I want to add the '>' line which corresponds the string plus a increasing number (in order to be unique each '>' line). I've been trying different approaches but without success.
You have already done most of the work. Here are my additions:
awk '$0~/^>/{prev=$0;}
{gsub(/[N]{5,}/,"\n"prev"_INSERTNUMBER\n");
for(counter=1;sub(/INSERTNUMBER/,counter++,$0)>0;){}}1' test
which yields the desired output
>seq
GATGGATTCGGA
>seq_1
GTTGTAGGG
>seq_2
GATAGAGAGNN
>suq
AAHAHAH
What have I added?
1. With $0~/^>/{prev=$0;}
I store the content of the previous line that started with >
.
2. Then, I replace [N]{5,}
with \n
prev_INSERTNUMBER\n
(i.e., \n>seq_INSERTNUMBER\n
)
3. Finally, we replace all INSERTNUMBER
s with (1,2,...)