I want to bin an integer column and count how many values in each bin. I have a working solution, but it's limited as it wouldn't show bins that have no values in them.
Consider the hp
column in mtcars
. I want to bin it by intervals of 40
, to get the desired output:
# desired output
# +────────+────────+────────+
# | lower | upper | count |
# +────────+────────+────────+
# | 40 | 80 | 5 |
# | 80 | 120 | 10 |
# | 120 | 160 | 4 |
# | 160 | 200 | 6 |
# | 200 | 240 | 3 |
# | 240 | 280 | 3 |
# | 280 | 320 | 0 |
# | 320 | 360 | 1 |
# +────────+────────+────────+
So far, I have the current code:
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
mtcars |>
group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10)) |>
tally() |>
separate(hp_interval, into = c("lower", "upper"), sep = ",") |>
mutate(across(c(lower, upper), ~str_extract(.x, "\\d+") |> as.numeric()))
#> # A tibble: 7 x 3
#> lower upper n
#> <dbl> <dbl> <int>
#> 1 40 80 5
#> 2 80 120 10
#> 3 120 160 4
#> 4 160 200 6
#> 5 200 240 3
#> 6 240 280 3 \ ❌ there's one bin missing here! ❌
#> 7 320 360 1 / the bin of 280-320 has 0 values and was implicitly removed
Since ggplot2::cut_interval()
is just a wrapper around cut()
, I can pass arguments to base::cut.default
via ...
. My question is whether there's a simple trick to force showing bins with 0
values.
You miss .drop = FALSE
in group_by()
. It determines whether the groups formed by factor levels that don't appear in the data should be dropped or not.
mtcars |>
group_by(hp_interval = ggplot2::cut_interval(hp , length = 40, dig.lab = 10), .drop = FALSE) |>
tally() |>
extract(hp_interval, into = c("lower", "upper"), regex = "(\\d+),(\\d+)", convert = TRUE)
# A tibble: 8 × 3
lower upper n
<int> <int> <int>
1 40 80 5
2 80 120 10
3 120 160 4
4 160 200 6
5 200 240 3
6 240 280 3
7 280 320 0
8 320 360 1