[SOLVED] Finding contiguity by comparing kmers in R

Finding contiguity by comparing kmers in R

Hello I have a dataframe which looks like this:

  LR ID           Kmer       ProcID
    1         GTACGTAT         10
    1         TACGTATC         10
    1         ACGTATCG          2
    1         GTATCGTT          3
    2         GTTACGTA         16
    2         TTACGTAC         16
    2         TACGTACT         16
    2         ACGTACTT         11

Output is something like:

LR1 max length: 16 #(as 2 kmers are consecutively going to proc 10)
LR1 min length: 8
LR2 max length: 24 #(as 3 kmers are consecutively going to proc 16)

There are 800 LR Ids like these which have kmers going to different processes. My objective is to find the longest uninterrupted sequence belonging to one LR ID going to the same destination proc id. I need to compare the (k-1) characters of one row to its next and so on.

I know there is this function called

str_detect()

in R which checks to see if any pattern exists or not. I was wondering is there any other better way to do this?

Solution

We can use

library(dplyr)
df1 %>% 
    count(LRID, grp = cumsum(ProcID != lag(ProcID, default = first(ProcID)))) %>%
    group_by(LRID) %>% 
    summarise(max = max(n) * 8, 
             min = min(n) * 8, .groups = 'drop')
# A tibble: 2 x 3
#   LRID   max   min
#  <int> <dbl> <dbl>
#1     1    16     8
#2     2    24     8

data

df1 <- structure(list(LRID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), Kmer = c("GTACGTAT", 
"TACGTATC", "ACGTATCG", "GTATCGTT", "GTTACGTA", "TTACGTAC", "TACGTACT", 
"ACGTACTT"), ProcID = c(10L, 10L, 2L, 3L, 16L, 16L, 16L, 11L)),
class = "data.frame", row.names = c(NA, 
-8L))