Say I have a sequence like myseq
below. It is a DNA sequence, so each group of 3 letters make for one letter (aminoacid) in the accompanying myaa
sequence. It has a length of 13aa.
I want to create a mydf
dataframe with the start and end positions of each group of 3 letters in myseq
. In the basic example below without gaps, I do this easily with seq()
.
myseq <- "CTACGTAGCTAGCTGGGGTACCGTTATTCAGCTAGCATG"
myaa <- "XYZWWXZZVVYZX"
st_pos <- seq(1, nchar(myseq)-2, 3)
en_pos <- seq(3, nchar(myseq), 3)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))
This produces this dataframe, which is exactly what I want:
> mydf
starts ends label
1 1 3 X
2 4 6 Y
3 7 9 Z
4 10 12 W
5 13 15 W
6 16 18 X
7 19 21 Z
8 22 24 Z
9 25 27 V
10 28 30 V
11 31 33 Y
12 34 36 Z
13 37 39 X
However, I am encountering examples with my real data where myseq
contains gaps. In these cases I cannot rely on seq()
because I need to account for the gaps for the start and end positions.
How should I go about it? I show you 2 cases below, and my expected mydf
dataframe, for which I just hard coded the start and end positions.
#case 1 - gaps breaking groups of 3 letters in half
myseq <- "CTACGTAGCTAGCTGGGGTACCGTT---ATTC--AGCTAGCATG"
st_pos <- c(1,4,7,10,13,16,19,22,25,31,36,39,42)
en_pos <- c(3,6,9,12,15,18,21,24,30,35,38,41,44)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))
#case 2 - gaps in between groups of 3 letters, and at the beginning of groups of 3 letters
myseq <- "CTACGTAGCTAGCTGGGGTACCGT---TAT--TCAGCTAGCATG"
st_pos <- c(1,4,7,10,13,16,19,22,28,33,36,39,42)
en_pos <- c(3,6,9,12,15,18,21,24,30,35,38,41,44)
mydf <- data.frame(starts=st_pos, ends=en_pos, label=unlist(strsplit(myaa, "")))
How should I produce the st_pos
and en_pos
vectors in these cases in an easy way?
EDIT
To get these numbers, I just split the sequences in groups of 3 letters and count the positions manually, but I don't know how to accomplish that in an automatic way.
For case 1 sequence start positions:
1 4 7 10 13 16 19 22 25 31 36 39 42
CTA CGT AGC TAG CTG GGG TAC CGT T---AT TC--A GCT AGC ATG
Likewise for case 1 end positions:
3 6 9 12 15 18 21 24 30 35 38 41 44
CTA CGT AGC TAG CTG GGG TAC CGT T---AT TC--A GCT AGC ATG
Now for case 2 sequence start positions:
1 4 7 10 13 16 19 22 28 33 36 39 42
CTA CGT AGC TAG CTG GGG TAC CGT --- TAT -- TCA GCT AGC ATG
And case 2 end positions:
3 6 9 12 15 18 21 24 30 35 38 41 44
CTA CGT AGC TAG CTG GGG TAC CGT --- TAT -- TCA GCT AGC ATG
Try the following, which identifies the index of any upper-case letter in the sequence using grep
and then creates the start and end positions from this.
Case #1
myseq <- "CTACGTAGCTAGCTGGGGTACCGTT---ATTC--AGCTAGCATG"
idx <- grep("[A-Z]", unlist(strsplit(myseq, "")))
data.frame(st_pos=idx[seq(1, length(idx), 3)],
en_pos=idx[seq(3, length(idx), 3)])
Gives:
st_pos en_pos
1 1 3
2 4 6
3 7 9
4 10 12
5 13 15
6 16 18
7 19 21
8 22 24
9 25 30
10 31 35
11 36 38
12 39 41
13 42 44
which matches your expected output.
Case #2
myseq <- "CTACGTAGCTAGCTGGGGTACCGT---TAT--TCAGCTAGCATG"
idx <- grep("[A-Z]", unlist(strsplit(myseq, "")))
data.frame(st_pos=idx[seq(1, length(idx), 3)],
en_pos=idx[seq(3, length(idx), 3)])
Gives
st_pos en_pos
1 1 3
2 4 6
3 7 9
4 10 12
5 13 15
6 16 18
7 19 21
8 22 24
9 28 30
10 33 35
11 36 38
12 39 41
13 42 44
Which also matches your expected output.