rggplot2statgeomgeom-histogram

what's diffrent of geom_histogram about statistical transformations


i have no idea with the difference between statistical transformations about plot1 and plot2 ?

plot1 <- ggplot(mpg, aes(x = hwy)) +
  geom_histogram(stat = "density")

plot2 <-ggplot(mpg, aes(x = hwy)) +
  geom_histogram(aes(y = after_stat(density)))

i try to compare y axes' meanins about two plot, i think plot 1 is similar to geom_density, Plot2 is that hwy is Calculated by the density first, then Calculated counts in the bin . Is my understanding correct?


Solution

  • Right, plot2 takes the default behavior of geom_histogram, where the data is binned into 30 bins, and counted by those bins. We can use ggplot2::layer_data to see the calculations it's doing.

    In this data, the default bins come out 1.103 units wide. Since the hwy data is integers, this means most bins reflect one hwy value, but a few, like the 14th one, reflect two hwy values.

    The 14th bin spans from xmin of 25.93 to xmax of 27.03, so it includes all the hwy 26 or 27 observations. That's 46 (19.6%) of the 234 observations, but since each bin is 1.103 wide, the calculated height of that binned data is 46 / 234 / 1.103 = 0.178. That way, the total area of the bins will be 1.

    layer_data(ggplot(mpg, aes(x = hwy)) +
                 geom_histogram(aes(y = after_stat(density))))
    
    
    `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
                 y count        x     xmin     xmax     density     ncount   ndensity flipped_aes PANEL group ymin        ymax colour   fill linewidth linetype
    1  0.019364316     5 12.13793 11.58621 12.68966 0.019364316 0.10869565 0.10869565       FALSE     1    -1    0 0.019364316     NA grey35       0.5        1
    2  0.000000000     0 13.24138 12.68966 13.79310 0.000000000 0.00000000 0.00000000       FALSE     1    -1    0 0.000000000     NA grey35       0.5        1
    3  0.007745726     2 14.34483 13.79310 14.89655 0.007745726 0.04347826 0.04347826       FALSE     1    -1    0 0.007745726     NA grey35       0.5        1
    ...
    14 0.178151709    46 26.48276 25.93103 27.03448 0.178151709 1.00000000 1.00000000       FALSE     1    -1    0 0.178151709     NA grey35       0.5        1
    ...
    
    
    layer_data(ggplot(mpg, aes(x = hwy)) +
                 geom_histogram(aes(y = after_stat(density))) ) |>
      summarize(area = sum((xmax - xmin) * density))
    
    `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
      area
    1    1
    

    Here's a comparison of a few variations. stat = density is completely different, showing the calculated kernel density estimation. stat = count counts each value of hwy, aligning to those integer values. The default geom_histogram behavior bins into 30 bins and gives the count. (We can see the 14th bin spiking up high to 46.) after_stat(density) converts those bin values to have a total area of 1. We could alternately specify the bin widths be 1 to get a histogram with area 1 that corresponds to the pattern we saw with stat = count.

    enter image description here

    library(patchwork); library(tidyverse)
    ggplot(mpg, aes(x = hwy)) +
      geom_histogram(stat = "density") +
      labs(title = "stat = density") |
    
    ggplot(mpg, aes(x = hwy)) +
      geom_histogram(stat = "count") +
      labs(title = "stat = count") |
    
    ggplot(mpg, aes(x = hwy)) +
      geom_histogram() +
      labs(title = "default") |
      
    ggplot(mpg, aes(x = hwy)) +
      geom_histogram(aes(y = after_stat(density))) +
      labs(title = "after_stat(density)") |
      
    ggplot(mpg, aes(x = hwy)) +
      geom_histogram(aes(y = after_stat(density)), binwidth = 1) +
      labs(title = "after_stat(density),\nbinwidth = 1")