[SOLVED] gsub consecutive characters and keep previous line

gsub consecutive characters and keep previous line

I have this input file:

>seq
GATGGATTCGGANNNNNNNNNNNNNNNGTTGTAGGGNNNNNNNNNNNNNNNNNNNNNNGATAGAGAGNN
>suq
AAHAHAH

And this command:

awk '{gsub(/[N]{5,}/,"\n")}1' f.fa

The current output:

>seq
GATGGATTCGGA
GTTGTAGGG
GATAGAGAGNN
>suq
AAHAHAH

If 5 or more consecutive 'N'-s are found, the string will be separated into another line. The problem is that I want the output to be like this:

>seq
GATGGATTCGGA
>seq_1
GTTGTAGGG
>seq_2
GATAGAGAGNN
>suq
AAHAHAH

Before each linebreak, I want to add the '>' line which corresponds the string plus a increasing number (in order to be unique each '>' line). I've been trying different approaches but without success.

Solution

You have already done most of the work. Here are my additions:

 awk '$0~/^>/{prev=$0;}
      {gsub(/[N]{5,}/,"\n"prev"_INSERTNUMBER\n");
       for(counter=1;sub(/INSERTNUMBER/,counter++,$0)>0;){}}1' test

which yields the desired output

>seq
GATGGATTCGGA
>seq_1
GTTGTAGGG
>seq_2
GATAGAGAGNN
>suq
AAHAHAH

What have I added?
1. With $0~/^>/{prev=$0;} I store the content of the previous line that started with >.
2. Then, I replace [N]{5,} with \nprev_INSERTNUMBER\n (i.e., \n>seq_INSERTNUMBER\n)
3. Finally, we replace all INSERTNUMBERs with (1,2,...)