riteration

R data frame remove columns where a specified value(s) is at or above a threshold


Assume the following data frame with 20 rows and 3 columns:

set.seed(123)
df1 <- 
data.frame(
  x = sample(1:6, 20, replace = T),
  y = sample(1:6, 20, replace = T),
  z = sample(1:6, 20, replace = T)
)

df1

   x y z
1  3 1 4
2  6 5 5
3  3 3 5
4  2 2 3
5  2 2 6
6  6 1 1
7  3 6 2
8  5 3 5
9  4 4 5
10 6 6 4
11 6 1 5
12 1 3 2
13 2 5 1
14 3 4 1
15 5 2 3
16 3 5 1
17 3 1 6
18 1 1 5
19 4 2 1
20 1 3 2

I want to write a function that will identify columns where a specified scalar or vector occurs at or above a specified threshold. Then, I hope to use that function with select(where()) from the dplyr package to remove columns that satisfy the criteria.

For the given data frame, the scalar 6 appears four times in column "x" (or 0.2) and twice in columns "y" and "z" (0.1). If the specified scalar is 6 and the threshold is at least 0.2, it will return TRUE for "x" and FALSE for "y" and "z". Column "x" will then be removed, leaving:

    y z
1   1 4
2   5 5
3   3 5
4   2 3
5   2 6
6   1 1
7   6 2
8   3 5
9   4 5
10  6 4
11  1 5
12  3 2
13  5 1
14  4 1
15  2 3
16  5 1
17  1 6
18  1 5
19  2 1
20  3 2

I couldn't find an R-specific thread on this, but there are a couple of Python threads dealing with a similar issue, namely How to drop a column if there is more than 55% repeated values in the column?

I attempted to convert the code, but it doesn't seem to yield the desired result, with df_count returning a strange count that I don't understand.

drop_columns    <- c()
drop_threshold  <- 0.2

for (cols in colnames(df1)) {
  df_count                 <- as.data.frame(table(df1[[cols]]))
  colnames(df_count)       <- c("value", "count")
  df_count$drop_percentage <- df_count$count / nrow(df1)
  df_count$drop_criterion  <- df_count$drop_percentage >= drop_threshold
  
  if (any(df_count$drop_criterion)) {
    drop_columns <- c(drop_columns, cols)
  }
}

Solution

  • Based on all previous answers, the dplyr option is:

    library(dplyr)
    
    df1 %>%
      select(where(~ mean(.x == 6) < 0.2))
    
       y z
    1  1 4
    2  5 5
    3  3 5
    4  2 3
    5  2 6
    6  1 1
    7  6 2
    8  3 5
    9  4 5
    10 6 4
    11 1 5
    12 3 2
    13 5 1
    14 4 1
    15 2 3
    16 5 1
    17 1 6
    18 1 5
    19 2 1
    20 3 2