I have a list of data.frames. Each data.frame within the list has different dimensions (different number of rows and columns). I would list to select the data.frame from the list that has the most rows AND the most columns (i.e. the biggest dimensions overall).
I would like a tidyverse solution and in particular am trying to use purrr::keep()
- however I am getting stuck as I can't figure out how to make the index corresponding to the maximum dimensions into a logical condition.
I can do it in two steps, but it seems a bit repetitive and wondering if there is any way to do this in a single line, all within the purrr::keep()
function. Speed/efficiency is important too as in my real data there would be almost a hundred different data.frames to select from.
Here is what I tried (2-step version):
# First get the size of the data.frame with the biggest dimensions in the list:
maxdim = 1:length(dflist) %>%
map_vec(~ reduce(dim(dflist[[.x]]), `*`)) %>%
max()
# Now select the data.frame from the list that matches maxdim:
dfinal = purrr::keep(dflist, ~ reduce(dim(.), `*`) == maxdim)[[1]]
Note I had to use reduce()
with dim()
as I couldn't find a function that would give me the dimensions of a data.frame as a single number (i.e. number of rows multiplied by number of columns). It seems reasonably fast but would also like to know if there is a dedicated function for this. I also tried length(unlist(df)
but don't know if that is any faster.
Here is some code for making the example list of data.frames:
# Create example data:
df <- data.frame(
id = c(1,2,3,4,5,6,7,8,9,10),
c1 = c("a", "a", "c", NA, "c", "d", "c", NA, "a", "b"),
c2 = c(25, NA, 17, 5, 50, 43, 21, 2, 1, NA),
c3 = c(NA, "s", "r", NA, "r", "i", NA, "r", NA, NA),
c4 = c(1.0, 5.3, 2.9, NA, 6.1, NA, 2.5, 4.3, 9.1, 2.4),
c5 = c(5, 6, NA, 3, 1, 6, 7, 8, 2, 1)
)
# Make a vector of columns to iteratively drop:
cols2drop <- c("c2", "c3", "c4", "c5")
# Create the list of data.frames (subsets of the original) of different sizes:
dflist = cols2drop %>%
map(~ df %>% select(1:.x) %>% drop_na())
dflist[[which.max(map_dbl(dflist, ~prod(dim(.x))))]]
id c1 c2
1 1 a 25
2 3 c 17
3 5 c 50
4 6 d 43
5 7 c 21
6 9 a 1
You could invoke log
and do summation instead of multiplication:
dflist[[which.max(colSums(log(sapply(dflist, dim))))]]