rdplyrsubsettbl

Subsetting rows and columns for both data.frame and tbl_df


Data.frame allows operations on column subsets using [ , dropping single column/row outputs to vectors by default. Dplyr does not allow this, deliberately (and seemingly because coding was an absolute nightmare).

df <- data.frame(a = c(1:5,NA), b = c(1,1,1,2,2,2))
mean(df[,"a"], na.rm = T) # 3

dftbl <- as.tbl(df)
mean(dftbl[,"a"], na.rm = T) # NA

Advice is therefore to subset with [[ as this will deliver uniform outputs for both dfs and tbl_dfs. But: that's fine for columns or rows only, but not for rows+columns, and concerningly this difference can be missed if you don't check the warnings (which is my own fault admittedly), e.g.:

dfresult <- mean(df[df$b == 2, "a"], na.rm = T) # 4.5
tblresult <- mean(dftbl[dftbl$b == 2, "a"], na.rm = T) # NA_real_

Does anyone have any 'best practice' suggestions for performing column operations on row subsets? Is this where I should improve my dplyr game using filter & select? My attempts thus far keep hitting walls. Grateful for any golden rules. Thanks in advance.

dftbl %>% filter(b == 2) %>% select(a) %>% mean(na.rm = T) #NA

This fails in the same way, with the filtered & selected data STILL being an N*1 tibble which refuses to play with mean.

dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% .$a
# [1]  4  5 NA

But

dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% mean(.$a, na.rm = T)
# [1] NA

Solution

  • The reason is because we need [[ instead of [ as with [ it is still a tibble with one column. The mean expects the input as a vector

    mean(dftbl[["a"]], na.rm = TRUE) 
    #[1] 3
    

    Or use $

    mean(dftb$a, na.rm = TRUE) 
    

    Regarding the second case, select also returns a tibble with the selected columns. Instead, we can use pull to extract as a vector

    dftbl[dftbl$b == 2, "a"] %>% 
        pull(1)
    #[1]  4  5 NA
    

    Or if we don't want to load any libraries, use unlist

    mean(unlist(dftbl[dftbl$b == 2, "a"]), na.rm = TRUE)
    #[1] 4.5
    

    For the code mentioned in the OP's post

    dftbl %>% 
        filter(b == 2) %>% 
        select(a)  %>%
         .$a %>%
         mean(., na.rm = TRUE)
    #[1] 4.5
    

    Or with pull

    dftbl %>%
        filter(b == 2) %>% 
        pull(a) %>%
        mean(na.rm = TRUE)
    #[1] 4.5