rregexgsub

How to return the results only when matching is true in R gsub()


I'm working with a character vector in R (test) where I need to extract specific parts of strings that match a pattern while discarding the original strings that don't match.

My current solution is shown below. However, the regular expression "^S.*\\." is used twice. Is there anyway in R, similar to gsub("^S.*\\.","",names_with_S),but only return when matching is true?

test <- c("Sample1.data", "Sample2.info", "S123.results", "Sabc.temp", "Other.data", "Sxyz.final", "Sample3", "xaaa")

# My current solution
names_with_S = unique(grep("^S.*\\.", test, value = TRUE))
output_desired = unique(gsub("^S.*\\.","",names_with_S))

# Desired Output:
# [1] "data"     "info"     "results"  "temp"  "final" 


Solution

  • Since you aren't interested in keeping placeholders for strings not matching anything, we can do

    setdiff(gsub("S.*\\.", "", test), test)
    # [1] "data"    "info"    "results" "temp"    "final"  
    

    For fun, an alternative:

    strcapture("S.*\\.(.*)", test, list(a=""))
    #         a
    # 1    data
    # 2    info
    # 3 results
    # 4    temp
    # 5    <NA>
    # 6   final
    # 7    <NA>
    # 8    <NA>
    

    One can so strcapture(..) |> subset(!is.na(a)) and then pull it out with [[ or $ to get just the matching substrings.

    I'll note that in general extraction (and filtering) of a pattern is often done with one of four tools in R. Unfortunately, only the first two allow us to do it in one sweep:

    The latter two could work except they don't support one of two things that would (greatly) facilitate it: