[SOLVED] R data frame remove columns where a specified value(s) is at or above a threshold

R data frame remove columns where a specified value(s) is at or above a threshold

Assume the following data frame with 20 rows and 3 columns:

set.seed(123)
df1 <- 
data.frame(
  x = sample(1:6, 20, replace = T),
  y = sample(1:6, 20, replace = T),
  z = sample(1:6, 20, replace = T)
)

df1

I want to write a function that will identify columns where a specified scalar or vector occurs at or above a specified threshold. Then, I hope to use that function with select(where()) from the dplyr package to remove columns that satisfy the criteria.

For the given data frame, the scalar 6 appears four times in column "x" (or 0.2) and twice in columns "y" and "z" (0.1). If the specified scalar is 6 and the threshold is at least 0.2, it will return TRUE for "x" and FALSE for "y" and "z". Column "x" will then be removed, leaving:

I couldn't find an R-specific thread on this, but there are a couple of Python threads dealing with a similar issue, namely How to drop a column if there is more than 55% repeated values in the column?

I attempted to convert the code, but it doesn't seem to yield the desired result, with df_count returning a strange count that I don't understand.

drop_columns    <- c()
drop_threshold  <- 0.2

for (cols in colnames(df1)) {
  df_count                 <- as.data.frame(table(df1[[cols]]))
  colnames(df_count)       <- c("value", "count")
  df_count$drop_percentage <- df_count$count / nrow(df1)
  df_count$drop_criterion  <- df_count$drop_percentage >= drop_threshold
  
  if (any(df_count$drop_criterion)) {
    drop_columns <- c(drop_columns, cols)
  }
}

Solution

Based on all previous answers, the dplyr option is:

library(dplyr)

df1 %>%
  select(where(~ mean(.x == 6) < 0.2))

   y z
1  1 4
2  5 5
3  3 5
4  2 3
5  2 6
6  1 1
7  6 2
8  3 5
9  4 5
10 6 4
11 1 5
12 3 2
13 5 1
14 4 1
15 2 3
16 5 1
17 1 6
18 1 5
19 2 1
20 3 2