i have no idea with the difference between statistical transformations about plot1 and plot2 ?
plot1 <- ggplot(mpg, aes(x = hwy)) +
geom_histogram(stat = "density")
plot2 <-ggplot(mpg, aes(x = hwy)) +
geom_histogram(aes(y = after_stat(density)))
i try to compare y axes' meanins about two plot, i think plot 1 is similar to geom_density, Plot2 is that hwy is Calculated by the density first, then Calculated counts in the bin . Is my understanding correct?
Right, plot2
takes the default behavior of geom_histogram
, where the data is binned into 30 bins, and counted by those bins. We can use ggplot2::layer_data
to see the calculations it's doing.
In this data, the default bins come out 1.103 units wide. Since the hwy
data is integers, this means most bins reflect one hwy
value, but a few, like the 14th one, reflect two hwy
values.
The 14th bin spans from xmin of 25.93 to xmax of 27.03, so it includes all the hwy
26 or 27 observations. That's 46 (19.6%) of the 234 observations, but since each bin is 1.103 wide, the calculated height of that binned data is 46 / 234 / 1.103 = 0.178. That way, the total area of the bins will be 1.
layer_data(ggplot(mpg, aes(x = hwy)) +
geom_histogram(aes(y = after_stat(density))))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
y count x xmin xmax density ncount ndensity flipped_aes PANEL group ymin ymax colour fill linewidth linetype
1 0.019364316 5 12.13793 11.58621 12.68966 0.019364316 0.10869565 0.10869565 FALSE 1 -1 0 0.019364316 NA grey35 0.5 1
2 0.000000000 0 13.24138 12.68966 13.79310 0.000000000 0.00000000 0.00000000 FALSE 1 -1 0 0.000000000 NA grey35 0.5 1
3 0.007745726 2 14.34483 13.79310 14.89655 0.007745726 0.04347826 0.04347826 FALSE 1 -1 0 0.007745726 NA grey35 0.5 1
...
14 0.178151709 46 26.48276 25.93103 27.03448 0.178151709 1.00000000 1.00000000 FALSE 1 -1 0 0.178151709 NA grey35 0.5 1
...
layer_data(ggplot(mpg, aes(x = hwy)) +
geom_histogram(aes(y = after_stat(density))) ) |>
summarize(area = sum((xmax - xmin) * density))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
area
1 1
Here's a comparison of a few variations. stat = density
is completely different, showing the calculated kernel density estimation. stat = count
counts each value of hwy
, aligning to those integer values. The default geom_histogram
behavior bins into 30 bins and gives the count. (We can see the 14th bin spiking up high to 46.) after_stat(density
) converts those bin values to have a total area of 1. We could alternately specify the bin widths be 1 to get a histogram with area 1 that corresponds to the pattern we saw with stat = count
.
library(patchwork); library(tidyverse)
ggplot(mpg, aes(x = hwy)) +
geom_histogram(stat = "density") +
labs(title = "stat = density") |
ggplot(mpg, aes(x = hwy)) +
geom_histogram(stat = "count") +
labs(title = "stat = count") |
ggplot(mpg, aes(x = hwy)) +
geom_histogram() +
labs(title = "default") |
ggplot(mpg, aes(x = hwy)) +
geom_histogram(aes(y = after_stat(density))) +
labs(title = "after_stat(density)") |
ggplot(mpg, aes(x = hwy)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 1) +
labs(title = "after_stat(density),\nbinwidth = 1")