My example: I am trying to create a graph that captures how class pass rate changes by incoming GPA. Ideally, this will be in histogram form, where I can edit the binwidth quickly to see how pass rate changes at varying bins of GPA and can incorporate information about the density within each bin. In the simulated data, there are 200 observations, each with a GPA and pass value (0, 1).
set.seed(435)
GPA <- round(rnorm(n = 200, mean = 3.2, sd = .7), 2)
Pass <- rep(c(0, 1), 100)
data <- as.data.frame(cbind(GPA, Pass))
I think the graph I'm looking for is a combination of the following two options:
Option 1:
ggplot(data, aes(x = GPA, fill = factor(Pass))) +
geom_histogram(position = "fill", binwidth = .2, aes(y = ..count..)) +
scale_fill_manual(name = "Class Outcome",
labels = c("Did not Pass", "Passed"), values = c("#FFFFFF", "#333999")) +
labs(title = "Pass Rate by Incoming GPA", x = "Incoming GPA", y = "Proportion Passed")
In this option, I can see the proportion of students that passed at each bin of GPA (using white to erase out the proportion that did not pass), but I don't have any information about how many students are in each bin.
Option 2:
ggplot(data, aes(x = GPA, fill = ..count.., group = factor(Pass))) +
geom_histogram(position = "fill", binwidth = .2, aes(y = ..count..),
color = "white", size = 1) +
scale_fill_gradient(name = "Number of Students",
low = "#99CCFF", high = "#000099") +
labs(title = "Pass Rate by Incoming GPA", x = "Incoming GPA", y = "Proportion Passed")
In this graph, I can get the scale gradient and the proportions to include information about the number of students within each bin, but you can't tell the difference between people who passed and did not pass; they're all filled with the same gradient scale. Coloring the bars to try to differentiate by group doesn't help.
Is there a way to subset the scale_fill_gradient
to apply to different levels of a factor, so that I could use the different gradients to differentiate the proportion that passed and did not pass? Or is there a work-around somewhere?
Here's a (possibly sub-optimal) workaround. If we have information about the number (n) and proportion of students passing (p), we also have information about the number ((n/p) * (1-p)) and proportion of students not passing (1-p). Perhaps displaying both is a bit redundant. Maybe it's not, but that's my justification for "hiding" one set of bars.
Why not just remove the top bar by controlling alpha
? We can use scale_alpha_manual
to remove the top bars and also hide the display in the legend.
ggplot(data, aes(x = GPA, fill = ..count.., group = factor(Pass), alpha = factor(Pass))) +
geom_histogram(position = "fill", binwidth = .2, aes(y = ..count..),
color = "white", size = 1) +
scale_fill_gradient(name = "Number of Students",
low = "#99CCFF", high = "#000099") +
labs(title = "Pass Rate by Incoming GPA", x = "Incoming GPA", y = "Proportion Passed") +
scale_alpha_manual(values = c('0' = 0, '1' = 1),
guide = FALSE)
end note I would have preferred this to be a comment, but I couldn't express it concisely enough.