I have the following problem, which I already brought up in another question: I have a dataframe column with strings expressing single numeric values or ranges of values, like "1:1496,3545:4785,7781" and so on. I also have a dictionary where each numeric value is paired with a progressive ID, such that "91"="91", but "91bis"="92" (the first "double" item). I need to replace each numeric value in the dataframe cells with the progressive ID.
Here is a sample of the dictionary:
dict <- structure(list(q.ID = c("1", "2", "3", "4", "5", "6", "7", "8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19",
"20", "21", "22", "23", "24", "25", "26", "27", "28", "29", "30",
"31", "32", "33", "34", "35", "36", "37", "38", "39", "40", "41",
"42", "43", "44", "45", "46", "47", "48", "49", "50", "51", "52",
"53", "54", "55", "56", "57", "58", "59", "60", "61", "62", "63",
"64", "65", "66", "67", "68", "69", "70", "71", "72", "73", "74",
"75", "76", "77", "78", "79", "80", "81", "82", "83", "84", "85",
"86", "87", "88", "89", "90", "91", "92", "93", "94", "95", "96",
"97", "98", "99", "100"), q.Voce = c("1", "2", "3", "4", "5",
"6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16",
"17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27",
"28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38",
"39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49",
"50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60",
"61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71",
"72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82",
"83", "84", "85", "86", "87", "88", "89", "90", "91", "91bis",
"91ter", "92", "93", "93bis", "94", "94bis", "95", "96")), row.names = c(NA,
100L), class = "data.frame")
EDIT
And here is what I've tried:
#library(stringr)
v <- "1,91bis:94" #example string
str_replace_all(v,setnames(dict$q.ID,dict$q.Voce))
Now, this should return [1] "1,92:97"
, but actually returns [1] "1,97:97"
. As the user @stefan pointed out when I first asked the question, this happens because the dictionary keys and values share elements: when the function runs, it inputs "91bis" and converts it to "92", but since 92 is also a value in the dictionary, it is again replaced by "94" and finally by "97". There is no value "97" in the dictionary, so the process stops, but with a larger sample it goes on (I've tried).
Is there any way to prevent this strange behaviour?
Regarding why the issue you were facing occurs, see R: Efficient way to str_replace_all without recursively replacing conflicting substitutions? One answer can be adapted to this situation, but it needs substantial changes given the format of your data:
v <- c("1,91bis:94", "1,91bis,94", "1,91bis,96")
# Output should be c("1,92,97", "1,92,97", "1,92,100")
str_replace_all(
v,
"(?<=,|^|:).*?(?=,|:|$)",
\(x) setNames(dict$q.ID, dict$q.Voce)[x]
)
# [1] "1,92:97" "1,92,97" "1,92,100"
The regex I think is easiest visualised:
This will replace anything between a comma and a colon (or the start and end of the string) with the values in the dictionary:
(?<=,|^|:)
: This part looks behind for either a comma, the start of a string, or a colon. The |
inside this lookbehind is not replaced with a non-capturing group because lookbehinds don't support alternation within non-capturing groups..*?
: Non-greedy match of any character as few times as possible.(?=(?:,|:|$))
: A lookahead for a comma, a colon, or the end of the string, without including them in the match.Unlike using strsplit()
, this means there is no information loss about whether the separator was a comma or colon.