rdplyracross

Compare variable pairs across dataframe in R


I would like to compare variable pairs across the entire data frame and create a flag variable that shows if they have the same value or not. Within my real dataset I have hundreds of these variable pairs.

The pairs have a similar name structure e.g. var and sum_var and an example in the iris dataset is Sepal.Length and sum_Sepal.Length

The end result should be extra variables named var_flag e.g. Sepal.Length_flag for each of the pairs with values 1 for value match (all Petal flags) otherwise 0 (all Sepal flags) for no match.

Any help shall be appreciated, especially using tidyverse. Thanks.

Data:

library(tidyverse)

local_iris <- iris %>% 
  mutate(
    across(all_of(starts_with("Sepal")), ~ sum(.x, na.rm = TRUE), .names = "sum_{.col}"),
    across(all_of(starts_with("Petal")), ~ .x + 0, .names = "sum_{.col}")
  )


Solution

  • local_iris |>
      mutate(across(starts_with("sum_"), 
                    \(x) +(x == pick(all_of(str_sub(cur_column(), 5)))[[1]]),
                    .names = "{.col}_flag"))
    

    How it works

    1. We want to iterate over all columns that start with "sum_". That gives the most flexibility so you don't have to hardcode all the prefixes (e.g. "Sepal", "Petal", etc.).
    2. cur_column() gives you the string name of the current column you are iterating over in across(). We use that to keep everything from the fifth character to the end. This effectively removes sum_ leaving just "Sepal.Length" for example.
    3. all_of() takes a vector of strings to look up a column name in the tidyverse. It is strict so it will error if it cannot find that column name (as opposed to any_of()).
    4. With this information we get a one column tibble using pick() and then extract ([[) that one column into a vector for comparison ==. This comparison returns a logical vector that we convert into numeric with +().
    5. Lastly, we store that numerically converted logical column in its own column using the flexible naming allowed with the .names argument of across().

    Output

    Only showing the structure of the first few rows:

    'data.frame':   6 obs. of  13 variables:
     $ Sepal.Length     : num  5.1 4.9 4.7 4.6 5 5.4
     $ Sepal.Width      : num  3.5 3 3.2 3.1 3.6 3.9
     $ Petal.Length     : num  1.4 1.4 1.3 1.5 1.4 1.7
     $ Petal.Width      : num  0.2 0.2 0.2 0.2 0.2 0.4
     $ Species          : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
     $ sum_Sepal.Length : num  876 876 876 876 876 ...
     $ sum_Sepal.Width  : num  459 459 459 459 459 ...
     $ sum_Petal.Length : num  1.4 1.4 1.3 1.5 1.4 1.7
     $ sum_Petal.Width  : num  0.2 0.2 0.2 0.2 0.2 0.4
     $ Sepal.Length_flag: int  0 0 0 0 0 0
     $ Sepal.Width_flag : int  0 0 0 0 0 0
     $ Petal.Length_flag: int  1 1 1 1 1 1
     $ Petal.Width_flag : int  1 1 1 1 1 1