rloopsclassificationtext-classification

Deterministic classification in R using regular expressions?


I have of list of regular expressions:

regex_list <- list("First Name" = "^[A-Za-z]+$",
                   "Postal Code" = "^[0-9]{5}$",
                   "Email" = "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")

And then I have a list of strings to be classified:

strings <- c(
  "John", "12345", "john.doe@email.com", "InvalidString", 
  "Alice", "54321", "example.com", "Bob", "67890", "contact@example.org",
  "Charlie", "98765", "test.email@test.co.uk", "David", "13579", "invalid.email",
  "Eva", "24680", "eva.smith@example.com", "Frank", "11111", "frank@email"
)

Now, I would like to classify each and every string according to the regex_list. While this can be achieved using two nested loops:

# Initialize an empty vector for categories
categories <- character(length(strings))

# Categorize the strings based on the regular expressions
for (i in 1:length(strings)) {
  for (j in 1:length(regex_list)) {
    if (grepl(regex_list[[j]], strings[i])) {
      categories[i] <- names(regex_list)[j]
      break
    }
  }
  # If it doesn't fit into any category, set it to "No Category"
  if (is.na(categories[i])) {
    categories[i] <- "No Category"
  }
}

...I was thinking a more elegant way of achieving this. What it could be? :)


Solution

  • Another simple approach:

    found <- apply(sapply(regex_list, grepl, x = strings), 1, function(z) which(z)[1])
    replace(names(regex_list)[found], is.na(found), "No Category")
    #  [1] "First Name"  "Postal Code" "Email"       "First Name"  "First Name"  "Postal Code" "No Category" "First Name"  "Postal Code" "Email"       "First Name"  "Postal Code" "Email"       "First Name" 
    # [15] "Postal Code" "No Category" "First Name"  "Postal Code" "Email"       "First Name"  "Postal Code" "No Category"
    

    This works by first creating a matrix of "found" or not:

    sapply(regex_list, grepl, x = strings)
    #       First Name Postal Code Email
    #  [1,]       TRUE       FALSE FALSE
    #  [2,]      FALSE        TRUE FALSE
    #  [3,]      FALSE       FALSE  TRUE
    #  [4,]       TRUE       FALSE FALSE
    #  [5,]       TRUE       FALSE FALSE
    #  [6,]      FALSE        TRUE FALSE
    #  [7,]      FALSE       FALSE FALSE
    #  [8,]       TRUE       FALSE FALSE
    #  [9,]      FALSE        TRUE FALSE
    # [10,]      FALSE       FALSE  TRUE
    # [11,]       TRUE       FALSE FALSE
    # [12,]      FALSE        TRUE FALSE
    # [13,]      FALSE       FALSE  TRUE
    # [14,]       TRUE       FALSE FALSE
    # [15,]      FALSE        TRUE FALSE
    # [16,]      FALSE       FALSE FALSE
    # [17,]       TRUE       FALSE FALSE
    # [18,]      FALSE        TRUE FALSE
    # [19,]      FALSE       FALSE  TRUE
    # [20,]       TRUE       FALSE FALSE
    # [21,]      FALSE        TRUE FALSE
    # [22,]      FALSE       FALSE FALSE
    

    Most of these have one TRUE per row, but some have nothing, so we need to be a little careful here. I'll use apply to operate row-wise (the MARGIN=1 means to operate on each row):

    sapply(regex_list, grepl, x = strings) |>
      apply(MARGIN = 1, function(z) which(z)[1])
    #  [1]  1  2  3  1  1  2 NA  1  2  3  1  2  3  1  2 NA  1  2  3  1  2 NA
    

    The which(z) gives us which within each row, but when nothing is found it will return an empty vector; the [1] however forces it to return NA in this case (and returns the first match when there is a true).

    Those numbers are indices on regex_list, so we can next index the names on them, replacing the NA with the no-category label.