rdataframedplyrdatatablemapping

using a "lookup" dataframe to map value pairings in R


I have a dataframe df of patient visits structured in the following format:

ID visit_number variable value
34 1 height short
34 1 weight over
34 1 eye_color brown
34 1 hair_color brown
89 1 weight normal
89 1 height short
89 2 height short
89 2 weight over
df <- data.frame(ID = c(34, 34, 34, 34, 89, 89, 89, 89), visit_number = c(1, 1, 1, 1, 1, 1, 2, 2), variable = c("height", "weight", "eye_color", "hair_color", "weight", "height", "height", "weight"), value = c("short", "over", "brown", "brown", "normal", "short", "short", "over"))

So for visit #1, patient 34 had their height, weight, eye, and hair color taken down. Patient 89 had two visits recording their height and weight.

I have a mapping file that shows a pairing of variables and values that results in a new value that I would like to pull for each relevant pairing:

var_1 val_1 var_2 val_2 new_var_name new_val
height short weight normal health average
height short weight over health warn
eye_color brown hair_color brown complexion monochrome
eye_color blue hair_color blonde complexion contrast
mapping <- data.frame( var_1= c("height", "height", "eye_color", "eye_color"), val_1 = c("short", "short", "brown", "blue"), var_2 = c("weight", "weight", "hair_color", "hair_color"), val_2 = c("normal", "over", "brown", "blonde"), new_var_name = c("health", "health", "complexion", "complexion"), new_val = c("average", "warn", "monochrome", "contrast"))

So this mapping file shows that if a patient's height is short and their weight is normal, I would like to create a health variable set to "average". If their height is short and their weight is over, I would like this health indicator to be "warn", and so on as the other pairings in the subsequent rows show.

What would be the best way to go about doing this? We can assume that all pairings will be covered in the mapping dataframe and we can also assume that there will not be incomplete pairings in the data.


Solution

  • This is a sort of brute force method that might run into trouble if you have more than dozens of variables, but I expect should work fine in many situations.

    First, I expand the data so every row is matched with every other row in the same patient's visit. That could get big if there are many variables. Then those pairings are matched with the mapping table, where only the labeled pairings are output.

    This could be made twice as memory-efficient if we could ensure that the mapping variables were always sorted alphabetically, with var_1 < var_2. Then we could limit the first join to those pairings where variable.x < variable.y, resulting in half as many combinations to look up for the 2nd join.

    library(dplyr)
    df |>
      inner_join(df, join_by(ID, visit_number), relationship = "many-to-many") |>
      inner_join(mapping, join_by(variable.x == `var_1`, value.x == `val_1`,
                                 variable.y == `var_2`, value.y == `val_2`)) 
    

    Result

      ID visit_number variable.x value.x variable.y value.y new_var_name    new_val
    1 34            1     height   short     weight    over       health       warn
    2 34            1  eye_color   brown hair_color   brown   complexion monochrome
    3 89            1     height   short     weight  normal       health    average
    4 89            2     height   short     weight    over       health       warn