rregex

Extracting part of string by position in R


I have a vector of strings string which look like this

ABC_EFG_HIG_ADF_AKF_MNB

Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R


Solution

  • We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).

    sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
    #[1] "HIG"
    

    Or another option is with scan

    scan(text=str1, sep="_", what="", quiet=TRUE)[3]
    #[1] "HIG"
    

    A similar option as mentioned by @RHertel would be to use read.table/read.csv on the string

     read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]
    

    data

    str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"