awk

gsub consecutive characters and keep previous line


I have this input file:

>seq
GATGGATTCGGANNNNNNNNNNNNNNNGTTGTAGGGNNNNNNNNNNNNNNNNNNNNNNGATAGAGAGNN
>suq
AAHAHAH

And this command:

awk '{gsub(/[N]{5,}/,"\n")}1' f.fa

The current output:

>seq
GATGGATTCGGA
GTTGTAGGG
GATAGAGAGNN
>suq
AAHAHAH

If 5 or more consecutive 'N'-s are found, the string will be separated into another line. The problem is that I want the output to be like this:

>seq
GATGGATTCGGA
>seq_1
GTTGTAGGG
>seq_2
GATAGAGAGNN
>suq
AAHAHAH

Before each linebreak, I want to add the '>' line which corresponds the string plus a increasing number (in order to be unique each '>' line). I've been trying different approaches but without success.


Solution

  • You have already done most of the work. Here are my additions:

     awk '$0~/^>/{prev=$0;}
          {gsub(/[N]{5,}/,"\n"prev"_INSERTNUMBER\n");
           for(counter=1;sub(/INSERTNUMBER/,counter++,$0)>0;){}}1' test
    

    which yields the desired output

    >seq
    GATGGATTCGGA
    >seq_1
    GTTGTAGGG
    >seq_2
    GATAGAGAGNN
    >suq
    AAHAHAH
    

    What have I added?
    1. With $0~/^>/{prev=$0;} I store the content of the previous line that started with >.
    2. Then, I replace [N]{5,} with \nprev_INSERTNUMBER\n (i.e., \n>seq_INSERTNUMBER\n)
    3. Finally, we replace all INSERTNUMBERs with (1,2,...)