rstringstring-matching

Matching the start of a sequence in R


I have a series of string in a vector and need to remove the matching starting pattern from the string. However, I don't know the pattern or how long it is.

stringa <- c("apple_tart", "apple_pie", "apple_fritter")
stringb <- c("baby breath", "baby oil", "baby doll", "baby name")

I would like the results to be. I need a function or method that will work for both a and b

resultsa  <- c("tart", "pie", "fritter")
resultsb <- c("breath", "oil", "doll", "name")

I know I could do this with str_remove if I knew the pattern or how long the matching pattern was. Is there a way to do this? Perhaps first find the starting string pattern to then use str_remove?


Solution

  • Use Recursion:

    remove_common <- function(x){
      a <- unique(substr(x, 1, 1))
      if(length(a) > 1) return(x)
      Recall(substr(x, 2, 100000L))
    }
    remove_common(stringa)
    [1] "tart"    "pie"     "fritter"
    remove_common(stringb)
    [1] "breath" "oil"    "doll"   "name"  
    

    Another base R option:

    fn <- function(x){
      n <- length(x) - 1
      y <- paste0(x, collapse = " ")
      pat <- regmatches(y, regexec(sprintf("(.*)(?:.*?\\1){%d}\\K", n),y, perl = TRUE))
      sub(unlist(pat)[2], "", x)
    }
    
    fn(stringa)
    [1] "tart"    "pie"     "fritter"
    fn(stringb)
    [1] "breath" "oil"    "doll"   "name"  
    

    Another way:

    fn <- function(x){
       f <- function(x, y){
         n <- seq(min(length(x), length(y)))
         e <- cumsum(x[n] != y[n])
         x[e == e[1]]
       }
       v <- paste0(Reduce(f, strsplit(x, "")), collapse = "")
       sub(v, "", x)
     }
    fn(stringa)
    [1] "tart"    "pie"     "fritter"
    fn(stringb)
    [1] "breath" "oil"    "doll"   "name"