I have of list of regular expressions:
regex_list <- list("First Name" = "^[A-Za-z]+$",
"Postal Code" = "^[0-9]{5}$",
"Email" = "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")
And then I have a list of strings to be classified:
strings <- c(
"John", "12345", "john.doe@email.com", "InvalidString",
"Alice", "54321", "example.com", "Bob", "67890", "contact@example.org",
"Charlie", "98765", "test.email@test.co.uk", "David", "13579", "invalid.email",
"Eva", "24680", "eva.smith@example.com", "Frank", "11111", "frank@email"
)
Now, I would like to classify each and every string according to the regex_list. While this can be achieved using two nested loops:
# Initialize an empty vector for categories
categories <- character(length(strings))
# Categorize the strings based on the regular expressions
for (i in 1:length(strings)) {
for (j in 1:length(regex_list)) {
if (grepl(regex_list[[j]], strings[i])) {
categories[i] <- names(regex_list)[j]
break
}
}
# If it doesn't fit into any category, set it to "No Category"
if (is.na(categories[i])) {
categories[i] <- "No Category"
}
}
...I was thinking a more elegant way of achieving this. What it could be? :)
Another simple approach:
found <- apply(sapply(regex_list, grepl, x = strings), 1, function(z) which(z)[1])
replace(names(regex_list)[found], is.na(found), "No Category")
# [1] "First Name" "Postal Code" "Email" "First Name" "First Name" "Postal Code" "No Category" "First Name" "Postal Code" "Email" "First Name" "Postal Code" "Email" "First Name"
# [15] "Postal Code" "No Category" "First Name" "Postal Code" "Email" "First Name" "Postal Code" "No Category"
This works by first creating a matrix
of "found" or not:
sapply(regex_list, grepl, x = strings)
# First Name Postal Code Email
# [1,] TRUE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE TRUE
# [4,] TRUE FALSE FALSE
# [5,] TRUE FALSE FALSE
# [6,] FALSE TRUE FALSE
# [7,] FALSE FALSE FALSE
# [8,] TRUE FALSE FALSE
# [9,] FALSE TRUE FALSE
# [10,] FALSE FALSE TRUE
# [11,] TRUE FALSE FALSE
# [12,] FALSE TRUE FALSE
# [13,] FALSE FALSE TRUE
# [14,] TRUE FALSE FALSE
# [15,] FALSE TRUE FALSE
# [16,] FALSE FALSE FALSE
# [17,] TRUE FALSE FALSE
# [18,] FALSE TRUE FALSE
# [19,] FALSE FALSE TRUE
# [20,] TRUE FALSE FALSE
# [21,] FALSE TRUE FALSE
# [22,] FALSE FALSE FALSE
Most of these have one TRUE
per row, but some have nothing, so we need to be a little careful here. I'll use apply
to operate row-wise (the MARGIN=1
means to operate on each row):
sapply(regex_list, grepl, x = strings) |>
apply(MARGIN = 1, function(z) which(z)[1])
# [1] 1 2 3 1 1 2 NA 1 2 3 1 2 3 1 2 NA 1 2 3 1 2 NA
The which(z)
gives us which within each row, but when nothing is found it will return an empty vector; the [1]
however forces it to return NA
in this case (and returns the first match when there is a true).
Those numbers are indices on regex_list
, so we can next index the names on them, replacing the NA
with the no-category label.