I have a large data frame, where I use combn to get the unique combinations between each pair of variables. So the output looks as follows ( with a small example of only 7 variables). The problem I have is that I wish to identify which features have more than say 50% of their correlations exceeding 0.5; however, the output from combn has the variable occuring in both columns. E.g. from the below Var2 occurs 5 times in total ( 4 in Variable_list1 and once in Variable_list2), but Var1 occurs 7 times in Variable_list1.
structure(list(Variable_list1 = c("Var1", "Var1", "Var1", "Var1",
"Var1", "Var1", "Var1", "Var2", "Var2", "Var2", "Var2", "Var3",
"Var3", "Var3", "Var3", "Var3", "Var4", "Var4", "Var4", "Var4",
"Var5", "Var5", "Var5", "Var6", "Var6", "Var7"), Variable_list2 = c("Var2",
"Var3", "Var4", "Var5", "Var6", "Var7", "Var8", "Var4", "Var5",
"Var6", "Var7", "Var4", "Var5", "Var6", "Var7", "Var8", "Var5",
"Var6", "Var7", "Var8", "Var6", "Var7", "Var8", "Var7", "Var8",
"Var8"), Coefficient = c("0.771428571428571", "0.839285714285714",
"0.839285714285714", "0.807142857142857", "0.775", "0.807142857142857",
"0.739285714285714", "0.964285714285714", "0.982142857142857",
"0.935714285714286", "0.953571428571429", "0.925", "0.946428571428571",
"0.957142857142857", "0.975", "0.921428571428571", "0.985714285714286",
"0.921428571428571", "0.935714285714286", "0.964285714285714",
"0.932142857142857", "0.957142857142857", "0.978571428571428",
"0.982142857142857", "0.95", "0.960714285714286")), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L, 14L, 15L, 16L, 17L,
18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")
I'm not sure how to approach this ( I understand you can use permutations to get the full list of combinations where the feature is occuring in the same number in Variable_list1, but I eliminate some rows anyway, so regardless the feature is occuring in different numbers in list 1 and 2).
I can use table(data$Variable_list1) and table(data$Variable_list2) and look at the number of times each feature is occurring, considering those variables unique to each column and also where the variable occurs in both. But how would I then introduce this as a new column into the results data frame above, to then calculate which features exceed a correlation coefficient threshold in more than 50% of their correlations, where each feature has different numbers of total correlations?
Any advice would be appreciated.
EDIT: an potential (inefficient) approach.
You can make a character vector of unique vars
, and then use lapply()
over that vector (i.e. for each v
in vars
, returning a boolean value indicating whether or not the set of rows with v
in either Variable_list1
or Variable_list2
column has the number of coefficients that meet your criteria:
library(data.table)
setDT(df)
vars = unique(union(df$Variable_list1, df$Variable_list2))
vars[sapply(vars, \(v) df[Variable_list1==v | Variable_list2==v, sum(Coefficient>.5)/.N > .5])]
Output:
[1] "Var1" "Var2" "Var3" "Var4" "Var5" "Var6" "Var7" "Var8"
(In this case, all variables meet your criteria, so they are all returned)
Another way to set this up is through a helper function that returns a one-row data frame of the information your need, given the df, and variable name; then use bind_rows()
to bind the result together over each value of vars
:
library(dplyr)
f <- function(df,v) {
df %>%
filter(Variable_list1 == v | Variable_list2==v) %>%
reframe(correlations = n(), prop_gt_threshold = sum(Coefficient>0.5)/n()) %>%
mutate(variable = v)
}
bind_rows(lapply(vars, \(v) f(df, v)))
Output:
correlations prop_gt_threshold variable
1 7 1 Var1
2 5 1 Var2
3 6 1 Var3
4 7 1 Var4
5 7 1 Var5
6 7 1 Var6
7 7 1 Var7
8 6 1 Var8