rdictionaryreplacetidyverse

Str_replace_all behaves strangely when passing a named vector with keys and values in common; what to do?


I have the following problem, which I already brought up in another question: I have a dataframe column with strings expressing single numeric values or ranges of values, like "1:1496,3545:4785,7781" and so on. I also have a dictionary where each numeric value is paired with a progressive ID, such that "91"="91", but "91bis"="92" (the first "double" item). I need to replace each numeric value in the dataframe cells with the progressive ID.

Here is a sample of the dictionary:

dict <- structure(list(q.ID = c("1", "2", "3", "4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", 
"20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", 
"31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41", 
"42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52", 
"53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63", 
"64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74", 
"75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85", 
"86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96", 
"97", "98", "99", "100"), q.Voce = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", 
"17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", 
"28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", 
"39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", 
"50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", 
"61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", 
"72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", 
"83", "84", "85", "86", "87", "88", "89", "90", "91", "91bis", 
"91ter", "92", "93", "93bis", "94", "94bis", "95", "96")), row.names = c(NA, 
100L), class = "data.frame")

EDIT

And here is what I've tried:

#library(stringr)

v <- "1,91bis:94" #example string

str_replace_all(v,setnames(dict$q.ID,dict$q.Voce))

Now, this should return [1] "1,92:97", but actually returns [1] "1,97:97". As the user @stefan pointed out when I first asked the question, this happens because the dictionary keys and values share elements: when the function runs, it inputs "91bis" and converts it to "92", but since 92 is also a value in the dictionary, it is again replaced by "94" and finally by "97". There is no value "97" in the dictionary, so the process stops, but with a larger sample it goes on (I've tried). Is there any way to prevent this strange behaviour?


Solution

  • Regarding why the issue you were facing occurs, see R: Efficient way to str_replace_all without recursively replacing conflicting substitutions? One answer can be adapted to this situation, but it needs substantial changes given the format of your data:

    v <- c("1,91bis:94", "1,91bis,94", "1,91bis,96")
    # Output should be c("1,92,97", "1,92,97", "1,92,100")
    str_replace_all(
        v,
        "(?<=,|^|:).*?(?=,|:|$)",
        \(x) setNames(dict$q.ID, dict$q.Voce)[x]
    )
    # [1] "1,92:97"  "1,92,97"  "1,92,100"
    

    The regex I think is easiest visualised:

    enter image description here

    This will replace anything between a comma and a colon (or the start and end of the string) with the values in the dictionary:

    Unlike using strsplit(), this means there is no information loss about whether the separator was a comma or colon.