rdataframesequence

Generate a regular numeric sequence by increments of 3 BUT considering gaps in the input


Say I have a sequence like myseq below. It is a DNA sequence, so each group of 3 letters make for one letter (aminoacid) in the accompanying myaa sequence. It has a length of 13aa.

I want to create a mydf dataframe with the start and end positions of each group of 3 letters in myseq. In the basic example below without gaps, I do this easily with seq().

myseq <- "CTACGTAGCTAGCTGGGGTACCGTTATTCAGCTAGCATG"
myaa <- "XYZWWXZZVVYZX"
st_pos <- seq(1, nchar(myseq)-2, 3)
en_pos <- seq(3, nchar(myseq), 3)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))

This produces this dataframe, which is exactly what I want:

> mydf
   starts ends label
1       1    3     X
2       4    6     Y
3       7    9     Z
4      10   12     W
5      13   15     W
6      16   18     X
7      19   21     Z
8      22   24     Z
9      25   27     V
10     28   30     V
11     31   33     Y
12     34   36     Z
13     37   39     X

However, I am encountering examples with my real data where myseq contains gaps. In these cases I cannot rely on seq() because I need to account for the gaps for the start and end positions.

How should I go about it? I show you 2 cases below, and my expected mydf dataframe, for which I just hard coded the start and end positions.

#case 1 - gaps breaking groups of 3 letters in half
myseq <- "CTACGTAGCTAGCTGGGGTACCGTT---ATTC--AGCTAGCATG"
st_pos <- c(1,4,7,10,13,16,19,22,25,31,36,39,42)
en_pos <- c(3,6,9,12,15,18,21,24,30,35,38,41,44)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))

#case 2 - gaps in between groups of 3 letters, and at the beginning of groups of 3 letters
myseq <- "CTACGTAGCTAGCTGGGGTACCGT---TAT--TCAGCTAGCATG"
st_pos <- c(1,4,7,10,13,16,19,22,28,33,36,39,42)
en_pos <- c(3,6,9,12,15,18,21,24,30,35,38,41,44)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))

How should I produce the st_pos and en_pos vectors in these cases in an easy way?

EDIT

To get these numbers, I just split the sequences in groups of 3 letters and count the positions manually, but I don't know how to accomplish that in an automatic way.

For case 1 sequence start positions:

1    4    7    10   13   16   19   22   25      31     36   39   42
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  T---AT  TC--A  GCT  AGC  ATG

Likewise for case 1 end positions:

  3    6    9    12   15   18   21   24      30     35   38   41   44
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  T---AT  TC--A  GCT  AGC  ATG

Now for case 2 sequence start positions:

1    4    7    10   13   16   19   22        28       33   36   39   42
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  ---  TAT  --  TCA  GCT  AGC  ATG

And case 2 end positions:

  3    6    9    12   15   18   21   24        30       35   38   41   44
CTA  CGT  AGC  TAG  CTG  GGG  TAC  CGT  ---  TAT  --  TCA  GCT  AGC  ATG

Solution

  • Try the following, which identifies the index of any upper-case letter in the sequence using grep and then creates the start and end positions from this.

    Case #1
    
    myseq <- "CTACGTAGCTAGCTGGGGTACCGTT---ATTC--AGCTAGCATG"
    
    idx <- grep("[A-Z]", unlist(strsplit(myseq, "")))
    data.frame(st_pos=idx[seq(1, length(idx), 3)], 
               en_pos=idx[seq(3, length(idx), 3)])
    

    Gives:

       st_pos en_pos
    1       1      3
    2       4      6
    3       7      9
    4      10     12
    5      13     15
    6      16     18
    7      19     21
    8      22     24
    9      25     30
    10     31     35
    11     36     38
    12     39     41
    13     42     44
    

    which matches your expected output.


    Case #2

    myseq <- "CTACGTAGCTAGCTGGGGTACCGT---TAT--TCAGCTAGCATG"
    
    idx <- grep("[A-Z]", unlist(strsplit(myseq, "")))
    data.frame(st_pos=idx[seq(1, length(idx), 3)], 
               en_pos=idx[seq(3, length(idx), 3)])
    

    Gives

       st_pos en_pos
    1       1      3
    2       4      6
    3       7      9
    4      10     12
    5      13     15
    6      16     18
    7      19     21
    8      22     24
    9      28     30
    10     33     35
    11     36     38
    12     39     41
    13     42     44
    

    Which also matches your expected output.