rdplyrdata.tablecorrelationcombn

How would I find the number of correlations exceeding a threshold in more than 50% of that variables correlations using the output from combn?


I have a large data frame, where I use combn to get the unique combinations between each pair of variables. So the output looks as follows ( with a small example of only 7 variables). The problem I have is that I wish to identify which features have more than say 50% of their correlations exceeding 0.5; however, the output from combn has the variable occuring in both columns. E.g. from the below Var2 occurs 5 times in total ( 4 in Variable_list1 and once in Variable_list2), but Var1 occurs 7 times in Variable_list1.

structure(list(Variable_list1 = c("Var1", "Var1", "Var1", "Var1", 
 "Var1", "Var1", "Var1", "Var2", "Var2", "Var2", "Var2", "Var3", 
"Var3", "Var3", "Var3", "Var3", "Var4", "Var4", "Var4", "Var4", 
"Var5", "Var5", "Var5", "Var6", "Var6", "Var7"), Variable_list2 =     c("Var2", 
  "Var3", "Var4", "Var5", "Var6", "Var7", "Var8", "Var4", "Var5", 
 "Var6", "Var7", "Var4", "Var5", "Var6", "Var7", "Var8", "Var5", 
  "Var6", "Var7", "Var8", "Var6", "Var7", "Var8", "Var7", "Var8", 
 "Var8"), Coefficient = c("0.771428571428571", "0.839285714285714", 
 "0.839285714285714", "0.807142857142857", "0.775",      "0.807142857142857", 
 "0.739285714285714", "0.964285714285714", "0.982142857142857", 
 "0.935714285714286", "0.953571428571429", "0.925",     "0.946428571428571", 
"0.957142857142857", "0.975", "0.921428571428571", "0.985714285714286", 
"0.921428571428571", "0.935714285714286", "0.964285714285714", 
 "0.932142857142857", "0.957142857142857", "0.978571428571428", 
 "0.982142857142857", "0.95", "0.960714285714286")), row.names = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L, 14L, 15L, 16L, 17L, 
18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class =     "data.frame")

I'm not sure how to approach this ( I understand you can use permutations to get the full list of combinations where the feature is occuring in the same number in Variable_list1, but I eliminate some rows anyway, so regardless the feature is occuring in different numbers in list 1 and 2).

I can use table(data$Variable_list1) and table(data$Variable_list2) and look at the number of times each feature is occurring, considering those variables unique to each column and also where the variable occurs in both. But how would I then introduce this as a new column into the results data frame above, to then calculate which features exceed a correlation coefficient threshold in more than 50% of their correlations, where each feature has different numbers of total correlations?

Any advice would be appreciated.

EDIT: an potential (inefficient) approach.

  1. subsection out the correlations exceeding threshold, 2)count the number of times the variable occurs across both columns of variables and save, 3)merge this with the number of total correlations these features have and get percentage, 4)keep the features > 50%

Solution

  • You can make a character vector of unique vars, and then use lapply() over that vector (i.e. for each v in vars, returning a boolean value indicating whether or not the set of rows with v in either Variable_list1 or Variable_list2 column has the number of coefficients that meet your criteria:

    library(data.table)
    setDT(df)
    vars = unique(union(df$Variable_list1, df$Variable_list2))
    vars[sapply(vars, \(v) df[Variable_list1==v | Variable_list2==v, sum(Coefficient>.5)/.N > .5])]
    

    Output:

    [1] "Var1" "Var2" "Var3" "Var4" "Var5" "Var6" "Var7" "Var8"
    

    (In this case, all variables meet your criteria, so they are all returned)

    Another way to set this up is through a helper function that returns a one-row data frame of the information your need, given the df, and variable name; then use bind_rows() to bind the result together over each value of vars:

    library(dplyr)
    
    f <- function(df,v) {
      df %>%
        filter(Variable_list1 == v | Variable_list2==v) %>% 
        reframe(correlations = n(), prop_gt_threshold = sum(Coefficient>0.5)/n()) %>% 
        mutate(variable = v)
    }
    
    bind_rows(lapply(vars, \(v) f(df, v)))
    

    Output:

      correlations prop_gt_threshold variable
    1            7                 1     Var1
    2            5                 1     Var2
    3            6                 1     Var3
    4            7                 1     Var4
    5            7                 1     Var5
    6            7                 1     Var6
    7            7                 1     Var7
    8            6                 1     Var8