I have data that looks like this:
expected_data
resp_migration_status | kmcluster | percentage | expected |
---|---|---|---|
1 Non-migrant | 1 | 21.9 | 30.5 |
2 Non-migrant | 2 | 30.1 | 27.4 |
3 Non-migrant | 3 | 24.7 | 19.9 |
4 Non-migrant | 4 | 23.3 | 22.3 |
5 Migrant | 1 | 41.9 | 30.5 |
6 Migrant | 2 | 22.6 | 27.4 |
7 Migrant | 3 | 19.4 | 19.9 |
8 Migrant | 4 | 16.1 | 22.3 |
9 Displaced | 1 | 36.9 | 30.5 |
10 Displaced | 2 | 26.2 | 27.4 |
11 Displaced | 3 | 11.9 | 19.9 |
12 Displaced | 4 | 25 | 22.3 |
I'd like to construct a bar graph which shows percentage by kmcluster and over resp_migration_status. I've done this successfully using this code:
ggplot(expected_data, aes(x = resp_migration_status, y = percentage, fill = kmcluster)) +
geom_bar(stat = "identity", position = "dodge") + # Use stat = "identity" for pre-computed values
labs(
title = "Percentage distribution of network cluster by migration status",
x = "Migration Status",
y = "Percentage",
fill = "Cluster"
) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentages
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
Overlayed on this bar graph, I'd like to do another graph with black outlines for the bars, which shows the expected percentage by kmcluster and over resp_migration_status. Essentially, it's a graphical representation of a chi-square test: understanding what the distribution of cluster would be by migration type if it was perfectly random, compared to the 'actual' distribution where some migration types are disproportionately in one cluster.
How do I overlay a very basic (black outlined) bar graph on the original graph to represent this? I have this code:
ggplot(expected_data, aes(x = resp_migration_status, y = expected, fill = kmcluster)) +
geom_bar(stat = "identity", position = "dodge", color = "black", fill = NA) + # Use stat = "identity" for pre-computed values, bars with black outlines
labs(
title = "Expected percentage distribution of network cluster by migration status",
x = "Migration Status",
y = "Percentage"
) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentages
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
But adding fill = NA inside geom_bar overrides the fill=cluster in the aes, such that it no longer divides the data across cluster types and it makes it into some strange stacked bar (see image).
So the first question is:
Secondly:
To add your second bars on top of the first you have to explicitly map on the group
aes to still get a dodged bar chart.
library(ggplot2)
ggplot(expected_data, aes(
x = resp_migration_status,
y = percentage, fill = factor(kmcluster)
)) +
geom_col(position = "dodge") +
geom_col(aes(y = expected, group = factor(kmcluster)),
color = "black", fill = NA, position = "dodge"
) +
labs(
title = "Percentage distribution of network cluster by migration status",
x = "Migration Status",
y = "Percentage",
fill = "Cluster"
) +
scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentages
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
DATA
expected_data <- data.frame(
resp_migration_status = c(
"Non-migrant", "Non-migrant", "Non-migrant", "Non-migrant",
"Migrant", "Migrant", "Migrant", "Migrant",
"Displaced", "Displaced", "Displaced", "Displaced"
),
kmcluster = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
percentage = c(
21.9, 30.1, 24.7,
23.3, 41.9, 22.6, 19.4, 16.1, 36.9, 26.2, 11.9, 25
),
expected = c(
30.5, 27.4, 19.9,
22.3, 30.5, 27.4, 19.9, 22.3, 30.5, 27.4, 19.9, 22.3
)
)