I'm working with a character vector in R (test) where I need to extract specific parts of strings that match a pattern while discarding the original strings that don't match.
My current solution is shown below. However, the regular expression "^S.*\\."
is used twice.
Is there anyway in R, similar to gsub("^S.*\\.","",names_with_S)
,but only return when matching is true?
test <- c("Sample1.data", "Sample2.info", "S123.results", "Sabc.temp", "Other.data", "Sxyz.final", "Sample3", "xaaa")
# My current solution
names_with_S = unique(grep("^S.*\\.", test, value = TRUE))
output_desired = unique(gsub("^S.*\\.","",names_with_S))
# Desired Output:
# [1] "data" "info" "results" "temp" "final"
Since you aren't interested in keeping placeholders for strings not matching anything, we can do
setdiff(gsub("S.*\\.", "", test), test)
# [1] "data" "info" "results" "temp" "final"
For fun, an alternative:
strcapture("S.*\\.(.*)", test, list(a=""))
# a
# 1 data
# 2 info
# 3 results
# 4 temp
# 5 <NA>
# 6 final
# 7 <NA>
# 8 <NA>
One can so strcapture(..) |> subset(!is.na(a))
and then pull it out with [[
or $
to get just the matching substrings.
I'll note that in general extraction (and filtering) of a pattern is often done with one of four tools in R. Unfortunately, only the first two allow us to do it in one sweep:
gsub()
as shown above and in ThomasIsCoding's innovative thought to add |.*
to the pattern and Filter(..)
itstrcapture()
regmatches(test, gregexpr(.., test))
, does not workstringr::str_extract_all()
, does not workThe latter two could work except they don't support one of two things that would (greatly) facilitate it:
(?:S.*\\.)
, which would allow us to match on a substring but not capture/extract it; gregexpr
captures it anyway, and stringr::
fails to match completely(?<=S.*\\.)
, this fails because lookahead/lookbehind (in R at least) needs to be of a known length, and the .*
defeats that notion (a more general regex guru might inform whether this is true with regex in general as well); one "might" use (?<=\\.)
, but that misses the leading "S", for which we'd need another regex :-(