regexrsubstr

Extract a substring according to a pattern


Suppose I have a list of string:

string = c("G1:E001", "G2:E002", "G3:E003")

Now I hope to get a vector of string that contains only the parts after the colon ":", i.e substring = c(E001,E002,E003).

Is there a convenient way in R to do this? Using substr?


Solution

  • Here are a few ways:

    1) sub

    sub(".*:", "", string)
    ## [1] "E001" "E002" "E003"
    

    2) strsplit

    sapply(strsplit(string, ":"), "[", 2)
    ## [1] "E001" "E002" "E003"
    

    3) read.table

    read.table(text = string, sep = ":", as.is = TRUE)$V2
    ## [1] "E001" "E002" "E003"
    

    4) substring

    This assumes second portion always starts at 4th character (which is the case in the example in the question):

    substring(string, 4)
    ## [1] "E001" "E002" "E003"
    

    4a) substring/regex

    If the colon were not always in a known position we could modify (4) by searching for it:

    substring(string, regexpr(":", string) + 1)
    

    5) strapplyc

    strapplyc returns the parenthesized portion:

    library(gsubfn)
    strapplyc(string, ":(.*)", simplify = TRUE)
    ## [1] "E001" "E002" "E003"
    

    6) read.dcf

    This one only works if the substrings prior to the colon are unique (which they are in the example in the question). Also it requires that the separator be colon (which it is in the question). If a different separator were used then we could use sub to replace it with a colon first. For example, if the separator were _ then string <- sub("_", ":", string)

    c(read.dcf(textConnection(string)))
    ## [1] "E001" "E002" "E003"
    

    7) separate

    7a) Using tidyr::separate we create a data frame with two columns, one for the part before the colon and one for after, and then extract the latter.

    library(dplyr)
    library(tidyr)
    library(purrr)
    
    DF <- data.frame(string)
    DF %>% 
      separate(string, into = c("pre", "post")) %>% 
      pull("post")
    ## [1] "E001" "E002" "E003"
    

    7b) Alternately separate can be used to just create the post column and then unlist and unname the resulting data frame:

    library(dplyr)
    library(tidyr)
    
    DF %>% 
      separate(string, into = c(NA, "post")) %>% 
      unlist %>%
      unname
    ## [1] "E001" "E002" "E003"
    

    8) trimws We can use trimws to trim word characters off the left and then use it again to trim the colon.

    trimws(trimws(string, "left", "\\w"), "left", ":")
    ## [1] "E001" "E002" "E003"
    

    Note

    The input string is assumed to be:

    string <- c("G1:E001", "G2:E002", "G3:E003")