rstringtidyversehexsubstitution

R: Efficient way to str_replace_all without recursively replacing conflicting substitutions?


Hello,

The problem
First, let me try to illustrate the problem. Assume I want to apply the following cipher to encode the string, "abc".

library(tidyverse)

cipher <- tibble(
  byte = c(128:153, 160:185, 246:255) %>% as.hexmode() %>% str_to_upper(),
  char = c(LETTERS, letters, 0:9)
)

"abc" %>% str_replace_all(set_names(cipher$byte, cipher$char))
# [1] "AFFCAFFDAFFE"

The result I'd like is "A0A1A2", not "AFFCAFFDAFFE". It looks like 0 in the first substitution A0 is replaced with its own substitution, which is FF, and so on. This is what I mean by recursive replacement of conflicting substitutions.

Related info
I've read this post. I've also read this issue. I've also looked into the vectorize_all argument of the stri_replace_all* function.

Working (but inefficient) solution
The only way I've managed successfully to make multiple string substitutions with replacement values that would otherwise be conflicting, is to split each string character for character, then make the substitutions, and finally paste it all back together. Like so:

library(tidyverse)

c("abc", "123") %>%
  map_chr(\(string) {
    str_split_1(string, "") %>%
      map_chr(\(char) {
        str_replace_all(char, set_names(cipher$byte, paste0("^", cipher$char, "$")))
      }) %>% paste(collapse = "")
  })
# [1] "A0A1A2" "F7F8F9"

Unfortunately, this way of encoding the strings takes a long time (on my 2020 Intel Macbook Pro at least) for large vectors. I prefer working within the tidyverse, but at this stage I'd consider other methods too.


Solution

  • As you are implementing a cipher, I would avoid regex in favour of the base function chartr():

    chartr translates each character in x that is specified in old to the corresponding character specified in new. Ranges are supported in the specifications, but character classes and repeated characters are not.

    The slight wrinkle in this plan is that chartr() only does one-for-one character substitution. You want to replace "a" with "A0" and so on. However, all your two character substitutions can be expressed in hex as one Unicode character.

    We can write a function to convert each of your one character substitutions to a single Unicode character so we can use it in chartr(), and then back again. For example, "A0" becomes intToUtf8("0xA0"), which is space, which is then fed to chartr() as a one character replacement for "a" and then at the end is converted back to "A0":

    do_cipher <- function(x, old = cipher$char, new = cipher$byte) {
        chartr(
            paste0(old, collapse = ""),
            intToUtf8(paste0("0x", new)),
            x
        ) |>
            vapply(
                \(x)
                x |>
                    utf8ToInt() |>
                    as.hexmode() |>
                    paste0(collapse = "") |>
                    toupper(),
                character(1)
            ) |>
            setNames(x)
    }
    

    This returns the desired output:

    c("abc", "123") |> do_cipher()
    #      abc      123
    # "A0A1A2" "F7F8F9"