My data is structured like this:
df <- data.frame(SampleID = c(
"Akt.B.M_1867_1:100", "Akt.B.M_1869_1:100", "Akt.B.M_1871_1:100",
"Akt.B.M_1878_1:100", "M.M.K_1259_1:100", "M.M.K_1262_1:100",
"M.M.K_1264_1:100", "M.M.K_1272_1:100", "SKO.Shard_30_1:100",
"SKO.Shard_32_1:100", "SKO.Shard_45_1:100", "Zh.San_618_1:100",
"Zh.San_624_1:100", "Zh.San_629_1:100"
))
I would like to extract the characters in front of the first period and the characters surrounded by the underline and place them into two subsequent columns. I tried to use separate_wider_delim
but there are some entries that have one period and some entries that have two periods. How would I do this in separate_wider_regex
or is there a better way without using this function?
Thanks
1) separate_wider_regex Using the input in the Note at the end use separate_wider_regex
as shown:
library(dplyr)
library(tidyr)
df %>%
separate_wider_regex(SampleID, c(A = "[^.]*", "\\..*_", B = ".*", "_.*"),
cols_remove = FALSE)
giving
# A tibble: 14 × 3
A B SampleID
<chr> <chr> <chr>
1 Akt 1867 Akt.B.M_1867_1:100
2 Akt 1869 Akt.B.M_1869_1:100
3 Akt 1871 Akt.B.M_1871_1:100
4 Akt 1878 Akt.B.M_1878_1:100
5 M 1259 M.M.K_1259_1:100
6 M 1262 M.M.K_1262_1:100
7 M 1264 M.M.K_1264_1:100
8 M 1272 M.M.K_1272_1:100
9 SKO 30 SKO.Shard_30_1:100
10 SKO 32 SKO.Shard_32_1:100
11 SKO 45 SKO.Shard_45_1:100
12 Zh 618 Zh.San_618_1:100
13 Zh 624 Zh.San_624_1:100
14 Zh 629 Zh.San_629_1:100
2) sub Using base R remove everything after the first dot giving A
and then match everything to the underscore, capture to the next underscore and match the rest keeping the capture.
transform(df,
A = sub("\\..*", "", SampleID),
B = sub(".*_(.*)_.*", "\\1", SampleID)
)
Variations of this are possible:
# 2a - this variation uses the same regex for both A and B
pat <- "([^.]*).*_(.*)_.*"
transform(df,
A = sub(pat, "\\1", SampleID),
B = sub(pat, "\\2", SampleID)
)
# 2b - this variation uses trimws to simplify the regexes needed
transform(df,
A = sub("\\..*", "", SampleID),
B = trimws(SampleID,, "[^_]") |> trimws(, "_")
)
# 2c - same but use Reduce to iteratively apply trimws
trimws2 <- function(x, y) trimws(x,, y)
transform(df,
A = sub("\\..*", "", SampleID),
B = Reduce(trimws2, init = SampleID, c("[^_]", "_"))
)
# 2d - use scan to avoid needing regular expressions
Scan <- function(x, ...) scan(text = x, what = "", quiet = TRUE, ...)
transform(df,
A = Scan(SampleID, comment.char = "."),
B = Scan(SampleID, sep = "_")[c(FALSE, TRUE, FALSE)]
)
3) separate_wider_delim If we replace the first dot with underscore then we can use separate_wider_delim
library(dplyr)
library(tidyr)
df %>%
mutate(tmp = sub("\\.", "_", SampleID)) %>%
separate_wider_delim(tmp, names = c("A", NA, "B", NA), delim = "_")
4) strsplit Using Base R use strsplit
to split the elements of SampleID
by dot and underscore and then take the first and second last of each such vector.
L <- strsplit(df$SampleID, "[._]")
transform(df,
A = sapply(L, head, 1),
B = sapply(L, \(x) tail(x, 2)[1])
)
5) strcapture With Base R use strapture
to extract the matches to the capture groups in the regular expression. pat
is from (2).
transform(df, part = strcapture(pat, SampleID, list(A = "", B = "")))
df <- data.frame(SampleID = c(
"Akt.B.M_1867_1:100", "Akt.B.M_1869_1:100", "Akt.B.M_1871_1:100",
"Akt.B.M_1878_1:100", "M.M.K_1259_1:100", "M.M.K_1262_1:100",
"M.M.K_1264_1:100", "M.M.K_1272_1:100", "SKO.Shard_30_1:100",
"SKO.Shard_32_1:100", "SKO.Shard_45_1:100", "Zh.San_618_1:100",
"Zh.San_624_1:100", "Zh.San_629_1:100"
)
)