rdplyrbinningdiscretization

How to bin an integer column with group_by() |> tally() and show all bins, even if no values in them?


I want to bin an integer column and count how many values in each bin. I have a working solution, but it's limited as it wouldn't show bins that have no values in them.

Consider the hp column in mtcars. I want to bin it by intervals of 40, to get the desired output:

# desired output
# +────────+────────+────────+
# | lower  | upper  | count  |
# +────────+────────+────────+
# | 40     | 80     | 5      |
# | 80     | 120    | 10     |
# | 120    | 160    | 4      |
# | 160    | 200    | 6      |
# | 200    | 240    | 3      |
# | 240    | 280    | 3      |
# | 280    | 320    | 0      |
# | 320    | 360    | 1      |
# +────────+────────+────────+

So far, I have the current code:

library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)

mtcars |> 
  group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |> 
  tally() |> 
  separate(hp_interval, into = c("lower", "upper"), sep = ",") |> 
  mutate(across(c(lower, upper), ~str_extract(.x, "\\d+") |> as.numeric()))

#> # A tibble: 7 x 3
#>   lower upper     n
#>   <dbl> <dbl> <int>
#> 1    40    80     5
#> 2    80   120    10
#> 3   120   160     4
#> 4   160   200     6
#> 5   200   240     3
#> 6   240   280     3  \ ❌ there's one bin missing here! ❌
#> 7   320   360     1  / the bin of 280-320 has 0 values and was implicitly removed

Since ggplot2::cut_interval() is just a wrapper around cut(), I can pass arguments to base::cut.default via .... My question is whether there's a simple trick to force showing bins with 0 values.


Solution

  • You miss .drop = FALSE in group_by(). It determines whether the groups formed by factor levels that don't appear in the data should be dropped or not.

    mtcars |> 
      group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10), .drop = FALSE) |> 
      tally() |>
      extract(hp_interval, into = c("lower", "upper"), regex = "(\\d+),(\\d+)", convert = TRUE)
    
    # A tibble: 8 × 3
      lower upper     n
      <int> <int> <int>
    1    40    80     5
    2    80   120    10
    3   120   160     4
    4   160   200     6
    5   200   240     3
    6   240   280     3
    7   280   320     0
    8   320   360     1