r loops classification text-classification

Deterministic classification in R using regular expressions?

I have of list of regular expressions:

regex_list <- list("First Name" = "^[A-Za-z]+$",
                   "Postal Code" = "^[0-9]{5}$",
                   "Email" = "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")

And then I have a list of strings to be classified:

strings <- c(
  "John", "12345", "john.doe@email.com", "InvalidString", 
  "Alice", "54321", "example.com", "Bob", "67890", "contact@example.org",
  "Charlie", "98765", "test.email@test.co.uk", "David", "13579", "invalid.email",
  "Eva", "24680", "eva.smith@example.com", "Frank", "11111", "frank@email"
)

Now, I would like to classify each and every string according to the regex_list. While this can be achieved using two nested loops:

# Initialize an empty vector for categories
categories <- character(length(strings))

# Categorize the strings based on the regular expressions
for (i in 1:length(strings)) {
  for (j in 1:length(regex_list)) {
    if (grepl(regex_list[[j]], strings[i])) {
      categories[i] <- names(regex_list)[j]
      break
    }
  }
  # If it doesn't fit into any category, set it to "No Category"
  if (is.na(categories[i])) {
    categories[i] <- "No Category"
  }
}

...I was thinking a more elegant way of achieving this. What it could be? :)

Solution

Another simple approach:

found <- apply(sapply(regex_list, grepl, x = strings), 1, function(z) which(z)[1])
replace(names(regex_list)[found], is.na(found), "No Category")
#  [1] "First Name"  "Postal Code" "Email"       "First Name"  "First Name"  "Postal Code" "No Category" "First Name"  "Postal Code" "Email"       "First Name"  "Postal Code" "Email"       "First Name" 
# [15] "Postal Code" "No Category" "First Name"  "Postal Code" "Email"       "First Name"  "Postal Code" "No Category"

This works by first creating a matrix of "found" or not:

sapply(regex_list, grepl, x = strings)
#       First Name Postal Code Email
#  [1,]       TRUE       FALSE FALSE
#  [2,]      FALSE        TRUE FALSE
#  [3,]      FALSE       FALSE  TRUE
#  [4,]       TRUE       FALSE FALSE
#  [5,]       TRUE       FALSE FALSE
#  [6,]      FALSE        TRUE FALSE
#  [7,]      FALSE       FALSE FALSE
#  [8,]       TRUE       FALSE FALSE
#  [9,]      FALSE        TRUE FALSE
# [10,]      FALSE       FALSE  TRUE
# [11,]       TRUE       FALSE FALSE
# [12,]      FALSE        TRUE FALSE
# [13,]      FALSE       FALSE  TRUE
# [14,]       TRUE       FALSE FALSE
# [15,]      FALSE        TRUE FALSE
# [16,]      FALSE       FALSE FALSE
# [17,]       TRUE       FALSE FALSE
# [18,]      FALSE        TRUE FALSE
# [19,]      FALSE       FALSE  TRUE
# [20,]       TRUE       FALSE FALSE
# [21,]      FALSE        TRUE FALSE
# [22,]      FALSE       FALSE FALSE

Most of these have one TRUE per row, but some have nothing, so we need to be a little careful here. I'll use apply to operate row-wise (the MARGIN=1 means to operate on each row):

sapply(regex_list, grepl, x = strings) |>
  apply(MARGIN = 1, function(z) which(z)[1])
#  [1]  1  2  3  1  1  2 NA  1  2  3  1  2  3  1  2 NA  1  2  3  1  2 NA

The which(z) gives us which within each row, but when nothing is found it will return an empty vector; the [1] however forces it to return NA in this case (and returns the first match when there is a true).

Those numbers are indices on regex_list, so we can next index the names on them, replacing the NA with the no-category label.