I'm working on automating toxicology read-in and import for unique case IDs from flat .csv files in R/RStudio. I've been asked to generate a spreadsheet that compares the differences between the manual toxicology process and the automated process.
I have a data.frame/tibble with a casenumber (the unique ID) column, a column containing the substances created during the automated read-in (auto.tox in the example), and a column containing the current, "production" toxicology substances that were created manually by someone reading and physically typing in substances that returned a result (production.tox in the example).
I need to compare the "auto.tox" column and the "production.tox" columns below to assess for precise string differences between each column in each row (for each casenumber). I believe example data will better help illustrate what I mean.
df <- tribble(
~casenumber, ~auto.tox, ~production.tox,
"2023-1", c("Gabapentin, Alprazolam, Ethanol"), c("Alprazolam, Gabapentin, Ethanol"),
"2023-2", c("Alprazolam, Fentanyl, Ethanol"), c("Fentanyl, Ethanol"),
"2023-3", c("Fentanyl, Norfentanyl, 4-ANPP, Acetyl Fentanyl"), c("Fentanyl, Norfentanyl"))
I would like to add a new column called "different_substances," that contains any substance(s) present in the "auto.tox" column that are not present in the "production.tox" column for each casenumber.
I've tried:
dplyr::anti_join()
with separate data.frames with substances by case and joining on casenumberlist(setdiff(auto.tox, production.tox))
mapply(vecsets::vsetdiff())
None of which gave me the precise output I'm looking for.
The desired output is:
casenumber different_substances
"2023-1" NA or "None" #(order doesn't matter)
"2023-2" "Alprazolam"
"2023-3" "4-ANPP, Acetyl Fentanyl"
I realize the last substance for 2023-3 should be spelled "acetylfentanyl" but our lookup table has a few misspellings, so the separate word is important, and it's important for "Fentanyl" alone to be recognized as different than "Acetyl Fentanyl".
I'm open to any general suggestions as well and also changing the format of the two substance columns to a more efficient format if needed.
If anyone can help, I'd greatly appreciate it. Thank you.
Here is one approach:
library(tidyverse)
# function to split the columns and remove whitespace
f_split = function(x) {
lapply(str_split(x, ","), str_trim)
}
## function to find chemicals not in one set
overlap = function(x, y) x[!x %in% y]
# apply f_split
auto_tox = f_split(df$auto.tox)
production_tox = f_split(df$production.tox)
df %>% mutate(diff = sapply(mapply(overlap, auto_tox, production_tox), paste0, collapse=", ")) %>%
mutate(diff = if_else(diff == "", "None", diff))
# A tibble: 3 × 5
casenumber auto.tox
<chr> <chr>
1 2023-1 Gabapentin, Alprazolam, Ethanol
2 2023-2 Alprazolam, Fentanyl, Ethanol
3 2023-3 Fentanyl, Norfentanyl, 4-ANPP, Acetyl Fentanyl
production.tox difff
<chr> <chr>
1 Alprazolam, Gabapentin, Ethanol ""
2 Fentanyl, Ethanol "Alprazolam"
3 Fentanyl, Norfentanyl "4-ANPP,Acetyl Fentanyl"
diff
<chr>
1 None
2 Alprazolam
3 4-ANPP,Acetyl Fentanyl