Splitting a string that is delimited by multiple delimiters (regex)

My data is structured like this:

df <- data.frame(SampleID = c(
  "Akt.B.M_1867_1:100", "Akt.B.M_1869_1:100", "Akt.B.M_1871_1:100",
  "Akt.B.M_1878_1:100", "M.M.K_1259_1:100",   "M.M.K_1262_1:100",
  "M.M.K_1264_1:100",   "M.M.K_1272_1:100",   "SKO.Shard_30_1:100",
  "SKO.Shard_32_1:100", "SKO.Shard_45_1:100", "Zh.San_618_1:100",
  "Zh.San_624_1:100",   "Zh.San_629_1:100"
))

I would like to extract the characters in front of the first period and the characters surrounded by the underline and place them into two subsequent columns. I tried to use separate_wider_delim but there are some entries that have one period and some entries that have two periods. How would I do this in separate_wider_regex or is there a better way without using this function?

Thanks

Solution

1) separate_wider_regex Using the input in the Note at the end use separate_wider_regex as shown:

library(dplyr)
library(tidyr)

df %>%
  separate_wider_regex(SampleID, c(A = "[^.]*", "\\..*_", B = ".*", "_.*"), 
    cols_remove = FALSE)

giving

# A tibble: 14 × 3
   A     B     SampleID          
   <chr> <chr> <chr>             
 1 Akt   1867  Akt.B.M_1867_1:100
 2 Akt   1869  Akt.B.M_1869_1:100
 3 Akt   1871  Akt.B.M_1871_1:100
 4 Akt   1878  Akt.B.M_1878_1:100
 5 M     1259  M.M.K_1259_1:100  
 6 M     1262  M.M.K_1262_1:100  
 7 M     1264  M.M.K_1264_1:100  
 8 M     1272  M.M.K_1272_1:100  
 9 SKO   30    SKO.Shard_30_1:100
10 SKO   32    SKO.Shard_32_1:100
11 SKO   45    SKO.Shard_45_1:100
12 Zh    618   Zh.San_618_1:100  
13 Zh    624   Zh.San_624_1:100  
14 Zh    629   Zh.San_629_1:100

2) sub Using base R remove everything after the first dot giving A and then match everything to the underscore, capture to the next underscore and match the rest keeping the capture.

transform(df, 
  A = sub("\\..*", "", SampleID),
  B = sub(".*_(.*)_.*", "\\1", SampleID)
)

Variations of this are possible:

# 2a - this variation uses the same regex for both A and B
pat <- "([^.]*).*_(.*)_.*"
transform(df, 
  A = sub(pat, "\\1", SampleID),
  B = sub(pat, "\\2", SampleID)
)

# 2b - this variation uses trimws to simplify the regexes needed
transform(df, 
  A = sub("\\..*", "", SampleID),
  B = trimws(SampleID,, "[^_]") |> trimws(, "_")
)

# 2c - same but use Reduce to iteratively apply trimws
trimws2 <- function(x, y) trimws(x,, y)
transform(df, 
  A = sub("\\..*", "", SampleID),
  B = Reduce(trimws2, init = SampleID, c("[^_]", "_"))
)

# 2d - use scan to avoid needing regular expressions
Scan <- function(x, ...) scan(text = x, what = "", quiet = TRUE, ...)
transform(df,
  A = Scan(SampleID, comment.char = "."),
  B = Scan(SampleID, sep = "_")[c(FALSE, TRUE, FALSE)]
)

3) separate_wider_delim If we replace the first dot with underscore then we can use separate_wider_delim

library(dplyr)
library(tidyr)

df %>%
  mutate(tmp = sub("\\.", "_", SampleID)) %>%
  separate_wider_delim(tmp, names = c("A", NA, "B", NA), delim = "_")

4) strsplit Using Base R use strsplit to split the elements of SampleID by dot and underscore and then take the first and second last of each such vector.

L <- strsplit(df$SampleID, "[._]")
transform(df,
  A = sapply(L, head, 1),
  B = sapply(L, \(x) tail(x, 2)[1])
)

5) strcapture With Base R use strapture to extract the matches to the capture groups in the regular expression. pat is from (2).

transform(df, part = strcapture(pat, SampleID, list(A = "", B = "")))

Note

df <- data.frame(SampleID = c(
    "Akt.B.M_1867_1:100", "Akt.B.M_1869_1:100", "Akt.B.M_1871_1:100",
    "Akt.B.M_1878_1:100", "M.M.K_1259_1:100", "M.M.K_1262_1:100",
    "M.M.K_1264_1:100", "M.M.K_1272_1:100", "SKO.Shard_30_1:100",
    "SKO.Shard_32_1:100", "SKO.Shard_45_1:100", "Zh.San_618_1:100",
    "Zh.San_624_1:100", "Zh.San_629_1:100"
  )
)