I would like to compare variable pairs across the entire data frame and create a flag variable that shows if they have the same value or not. Within my real dataset I have hundreds of these variable pairs.
The pairs have a similar name structure e.g. var and sum_var and an example in the iris dataset is Sepal.Length and sum_Sepal.Length
The end result should be extra variables named var_flag e.g. Sepal.Length_flag for each of the pairs with values 1 for value match (all Petal flags) otherwise 0 (all Sepal flags) for no match.
Any help shall be appreciated, especially using tidyverse. Thanks.
Data:
library(tidyverse)
local_iris <- iris %>%
mutate(
across(all_of(starts_with("Sepal")), ~ sum(.x, na.rm = TRUE), .names = "sum_{.col}"),
across(all_of(starts_with("Petal")), ~ .x + 0, .names = "sum_{.col}")
)
local_iris |>
mutate(across(starts_with("sum_"),
\(x) +(x == pick(all_of(str_sub(cur_column(), 5)))[[1]]),
.names = "{.col}_flag"))
How it works
"sum_"
. That gives the most flexibility so you don't have to hardcode all the prefixes (e.g. "Sepal", "Petal", etc.).cur_column()
gives you the string name of the current column you are iterating over in across()
. We use that to keep everything from the fifth character to the end. This effectively removes sum_
leaving just "Sepal.Length"
for example.all_of()
takes a vector of strings to look up a column name in the tidyverse. It is strict so it will error if it cannot find that column name (as opposed to any_of()
).pick()
and then extract ([[
) that one column into a vector for comparison ==
. This comparison returns a logical vector that we convert into numeric with +()
..names
argument of across()
.Output
Only showing the structure of the first few rows:
'data.frame': 6 obs. of 13 variables:
$ Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
$ sum_Sepal.Length : num 876 876 876 876 876 ...
$ sum_Sepal.Width : num 459 459 459 459 459 ...
$ sum_Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7
$ sum_Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Sepal.Length_flag: int 0 0 0 0 0 0
$ Sepal.Width_flag : int 0 0 0 0 0 0
$ Petal.Length_flag: int 1 1 1 1 1 1
$ Petal.Width_flag : int 1 1 1 1 1 1