I have a dataframe df of patient visits structured in the following format:
ID | visit_number | variable | value |
---|---|---|---|
34 | 1 | height | short |
34 | 1 | weight | over |
34 | 1 | eye_color | brown |
34 | 1 | hair_color | brown |
89 | 1 | weight | normal |
89 | 1 | height | short |
89 | 2 | height | short |
89 | 2 | weight | over |
df <- data.frame(ID = c(34, 34, 34, 34, 89, 89, 89, 89), visit_number = c(1, 1, 1, 1, 1, 1, 2, 2), variable = c("height", "weight", "eye_color", "hair_color", "weight", "height", "height", "weight"), value = c("short", "over", "brown", "brown", "normal", "short", "short", "over"))
So for visit #1, patient 34 had their height, weight, eye, and hair color taken down. Patient 89 had two visits recording their height and weight.
I have a mapping file that shows a pairing of variables and values that results in a new value that I would like to pull for each relevant pairing:
var_1 | val_1 | var_2 | val_2 | new_var_name | new_val |
---|---|---|---|---|---|
height | short | weight | normal | health | average |
height | short | weight | over | health | warn |
eye_color | brown | hair_color | brown | complexion | monochrome |
eye_color | blue | hair_color | blonde | complexion | contrast |
mapping <- data.frame( var_1= c("height", "height", "eye_color", "eye_color"), val_1 = c("short", "short", "brown", "blue"), var_2 = c("weight", "weight", "hair_color", "hair_color"), val_2 = c("normal", "over", "brown", "blonde"), new_var_name = c("health", "health", "complexion", "complexion"), new_val = c("average", "warn", "monochrome", "contrast"))
So this mapping file shows that if a patient's height is short and their weight is normal, I would like to create a health variable set to "average". If their height is short and their weight is over, I would like this health indicator to be "warn", and so on as the other pairings in the subsequent rows show.
What would be the best way to go about doing this? We can assume that all pairings will be covered in the mapping dataframe and we can also assume that there will not be incomplete pairings in the data.
This is a sort of brute force method that might run into trouble if you have more than dozens of variables, but I expect should work fine in many situations.
First, I expand the data so every row is matched with every other row in the same patient's visit. That could get big if there are many variables. Then those pairings are matched with the mapping table, where only the labeled pairings are output.
This could be made twice as memory-efficient if we could ensure that the mapping variables were always sorted alphabetically, with var_1 < var_2. Then we could limit the first join to those pairings where variable.x < variable.y, resulting in half as many combinations to look up for the 2nd join.
library(dplyr)
df |>
inner_join(df, join_by(ID, visit_number), relationship = "many-to-many") |>
inner_join(mapping, join_by(variable.x == `var_1`, value.x == `val_1`,
variable.y == `var_2`, value.y == `val_2`))
Result
ID visit_number variable.x value.x variable.y value.y new_var_name new_val
1 34 1 height short weight over health warn
2 34 1 eye_color brown hair_color brown complexion monochrome
3 89 1 height short weight normal health average
4 89 2 height short weight over health warn