Hello I have a dataframe which looks like this:
LR ID Kmer ProcID
1 GTACGTAT 10
1 TACGTATC 10
1 ACGTATCG 2
1 GTATCGTT 3
2 GTTACGTA 16
2 TTACGTAC 16
2 TACGTACT 16
2 ACGTACTT 11
Output is something like:
LR1 max length: 16 #(as 2 kmers are consecutively going to proc 10)
LR1 min length: 8
LR2 max length: 24 #(as 3 kmers are consecutively going to proc 16)
There are 800 LR Ids like these which have kmers going to different processes. My objective is to find the longest uninterrupted sequence belonging to one LR ID going to the same destination proc id. I need to compare the (k-1) characters of one row to its next and so on.
I know there is this function called
str_detect()
in R which checks to see if any pattern exists or not. I was wondering is there any other better way to do this?
We can use
library(dplyr)
df1 %>%
count(LRID, grp = cumsum(ProcID != lag(ProcID, default = first(ProcID)))) %>%
group_by(LRID) %>%
summarise(max = max(n) * 8,
min = min(n) * 8, .groups = 'drop')
# A tibble: 2 x 3
# LRID max min
# <int> <dbl> <dbl>
#1 1 16 8
#2 2 24 8
df1 <- structure(list(LRID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Kmer = c("GTACGTAT",
"TACGTATC", "ACGTATCG", "GTATCGTT", "GTTACGTA", "TTACGTAC", "TACGTACT",
"ACGTACTT"), ProcID = c(10L, 10L, 2L, 3L, 16L, 16L, 16L, 11L)),
class = "data.frame", row.names = c(NA,
-8L))