Assume the following data frame with 20 rows and 3 columns:
set.seed(123)
df1 <-
data.frame(
x = sample(1:6, 20, replace = T),
y = sample(1:6, 20, replace = T),
z = sample(1:6, 20, replace = T)
)
df1
x y z
1 3 1 4
2 6 5 5
3 3 3 5
4 2 2 3
5 2 2 6
6 6 1 1
7 3 6 2
8 5 3 5
9 4 4 5
10 6 6 4
11 6 1 5
12 1 3 2
13 2 5 1
14 3 4 1
15 5 2 3
16 3 5 1
17 3 1 6
18 1 1 5
19 4 2 1
20 1 3 2
I want to write a function that will identify columns where a specified scalar or vector occurs at or above a specified threshold. Then, I hope to use that function with select(where())
from the dplyr
package to remove columns that satisfy the criteria.
For the given data frame, the scalar 6
appears four times in column "x" (or 0.2) and twice in columns "y" and "z" (0.1). If the specified scalar is 6
and the threshold is at least 0.2
, it will return TRUE
for "x" and FALSE
for "y" and "z". Column "x" will then be removed, leaving:
y z
1 1 4
2 5 5
3 3 5
4 2 3
5 2 6
6 1 1
7 6 2
8 3 5
9 4 5
10 6 4
11 1 5
12 3 2
13 5 1
14 4 1
15 2 3
16 5 1
17 1 6
18 1 5
19 2 1
20 3 2
I couldn't find an R-specific thread on this, but there are a couple of Python threads dealing with a similar issue, namely How to drop a column if there is more than 55% repeated values in the column?
I attempted to convert the code, but it doesn't seem to yield the desired result, with df_count
returning a strange count that I don't understand.
drop_columns <- c()
drop_threshold <- 0.2
for (cols in colnames(df1)) {
df_count <- as.data.frame(table(df1[[cols]]))
colnames(df_count) <- c("value", "count")
df_count$drop_percentage <- df_count$count / nrow(df1)
df_count$drop_criterion <- df_count$drop_percentage >= drop_threshold
if (any(df_count$drop_criterion)) {
drop_columns <- c(drop_columns, cols)
}
}
Based on all previous answers, the dplyr
option is:
library(dplyr)
df1 %>%
select(where(~ mean(.x == 6) < 0.2))
y z
1 1 4
2 5 5
3 3 5
4 2 3
5 2 6
6 1 1
7 6 2
8 3 5
9 4 5
10 6 4
11 1 5
12 3 2
13 5 1
14 4 1
15 2 3
16 5 1
17 1 6
18 1 5
19 2 1
20 3 2