I have several thousand rows like this stored in a object in R. I want to make a new df2 by pairing elements from two columns ($exonStarts and $exonEnds) from the df below
df <- structure(list(bin = c(0L, 0L, 0L, 0L, 0L, 0L), name = c("XM_011541469.2",
"XM_017001276.2", "XM_011541467.2", "NM_001276352.2", "NM_001276351.2",
"XM_011541465.3"), chr = c("chr1", "chr1", "chr1", "chr1", "chr1",
"chr1"), strand = c("-", "-", "-", "-", "-", "-"), txStart = c(67092164L,
67092164L, 67092164L, 67092164L, 67092164L, 67092164L), txEnd = c(67109072L,
67131227L, 67131227L, 67134970L, 67134970L, 67134970L), cdsStart = c(67093004L,
67093004L, 67093004L, 67093579L, 67093004L, 67093004L), cdsEnd = c(67103382L,
67127240L, 67127240L, 67127240L, 67127240L, 67127240L), exonCount = c(5L,
9L, 9L, 9L, 8L, 9L), exonStarts = c("67092164,67095234,67096251,67103237,67109028,",
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141,",
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67131141,",
"67092164,67096251,67103237,67111576,67115351,67125751,67127165,67131141,67134929,",
"67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,",
"67092164,67095234,67096251,67103237,67111576,67115351,67125751,67127165,67134929,"
), exonEnds = c("67093604,67095421,67096321,67103382,67109072,",
"67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67131227,",
"67093604,67095421,67096321,67103343,67111644,67115464,67125909,67127257,67131227,",
"67093604,67096321,67103382,67111644,67115464,67125909,67127257,67131227,67134970,",
"67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,",
"67093604,67095421,67096321,67103382,67111644,67115464,67125909,67127257,67134970,"
), score = c(0L, 0L, 0L, 0L, 0L, 0L), gene = c("C1orf141", "C1orf141",
"C1orf141", "C1orf141", "C1orf141", "C1orf141"), cdsStartStat = c("cmpl",
"cmpl", "cmpl", "cmpl", "cmpl", "cmpl"), cdsEndStat = c("cmpl",
"cmpl", "cmpl", "cmpl", "cmpl", "cmpl"), exonFrames = c("0,2,1,0,-1,",
"0,2,1,0,1,2,0,0,-1,", "0,2,1,0,1,2,0,0,-1,", "2,1,0,1,2,0,0,-1,-1,",
"0,2,1,2,0,0,-1,-1,", "0,2,1,0,1,2,0,0,-1,")), row.names = c(NA,
6L), class = "data.frame")
to produce an output like below. The first row of the input df should produce 5 rows of df2 in the format below by paring each element of $exonStart with each element of $exonEnd (the element counts will identical and is present in the column $exonCount).
$chr $exonStart $exonEnd
chr1 67092164 67093604
chr1 67095234 67095421
chr1 67096251 67096321
chr1 67103237 67103382
chr1 67109028 67109072
This needs to iterate over all rows of input df and all results combined into df2. So, in total there would be sum(object$exonCount)
rows in df2.
I know that there would be some sort of strsplit
function in here to split $exonStarts and $exonEnds and then combine them. Maybe an apply
function to do this over all rows? I tried something like this first
map2(unlist(strsplit(df$exonStarts[1], ",")), unlist(strsplit(dft$exonEnds[1], ",")), c)
to produced a list of paired elements
[[1]]
[1] "67092164" "67093604"
[[2]]
[1] "67095234" "67095421"
[[3]]
[1] "67096251" "67096321"
[[4]]
[1] "67103237" "67103382"
[[5]]
[1] "67109028" "67109072"
But I do not know where to go from here. Any help would be appreciated.
You could use separate_longer_delim
from tidyr
library(tidyr)
library(dplyr)
select(df, chr, exonStarts, exonEnds) %>%
separate_longer_delim(c(exonStarts, exonEnds), delim=",")
Which gives (showing first 10 rows only):
chr exonStarts exonEnds
1 chr1 67092164 67093604
2 chr1 67095234 67095421
3 chr1 67096251 67096321
4 chr1 67103237 67103382
5 chr1 67109028 67109072
6 chr1
7 chr1 67092164 67093604
8 chr1 67095234 67095421
9 chr1 67096251 67096321
10 chr1 67103237 67103382
The blank rows are due to the trailing commas in the original columns, which can easily be removed using filter
. If you do that, you will see that there are 49 rows remaining, the same as sum(df$exonCount)
.
As @TarJae mentions, the number of items in each column must be equal, or an error will result. Presumably you are confident that this situation isn't possible.
df$exonEnds[1] <- "67093604,67095421,67096321,67103382,"
select(df, chr, exonStarts, exonEnds) %>%
separate_longer_delim(c(exonStarts, exonEnds), delim=",")
#Error in `separate_longer_delim()`:
#! In row 1, can't recycle input of size 6 to size 5.
#Run `rlang::last_trace()` to see where the error occurred.
You'll also get an error using your strsplit
approach.