rggplot2

How can I layer an outlined bar graph on top of a colored bar graph in ggplot?


I have data that looks like this:

expected_data

resp_migration_status kmcluster percentage expected
1 Non-migrant 1 21.9 30.5
2 Non-migrant 2 30.1 27.4
3 Non-migrant 3 24.7 19.9
4 Non-migrant 4 23.3 22.3
5 Migrant 1 41.9 30.5
6 Migrant 2 22.6 27.4
7 Migrant 3 19.4 19.9
8 Migrant 4 16.1 22.3
9 Displaced 1 36.9 30.5
10 Displaced 2 26.2 27.4
11 Displaced 3 11.9 19.9
12 Displaced 4 25 22.3

I'd like to construct a bar graph which shows percentage by kmcluster and over resp_migration_status. I've done this successfully using this code:

ggplot(expected_data, aes(x = resp_migration_status, y = percentage, fill = kmcluster)) +
  geom_bar(stat = "identity", position = "dodge") +  # Use stat = "identity" for pre-computed values
  labs(
    title = "Percentage distribution of network cluster by migration status",
    x = "Migration Status",
    y = "Percentage",
    fill = "Cluster"
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Overlayed on this bar graph, I'd like to do another graph with black outlines for the bars, which shows the expected percentage by kmcluster and over resp_migration_status. Essentially, it's a graphical representation of a chi-square test: understanding what the distribution of cluster would be by migration type if it was perfectly random, compared to the 'actual' distribution where some migration types are disproportionately in one cluster.

How do I overlay a very basic (black outlined) bar graph on the original graph to represent this? I have this code:

ggplot(expected_data, aes(x = resp_migration_status, y = expected, fill = kmcluster)) +
  geom_bar(stat = "identity", position = "dodge", color = "black", fill = NA) +  # Use stat = "identity" for pre-computed values, bars with black outlines
  labs(
    title = "Expected percentage distribution of network cluster by migration status",
    x = "Migration Status",
    y = "Percentage"
  ) +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +  # Format y-axis as percentages
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

But adding fill = NA inside geom_bar overrides the fill=cluster in the aes, such that it no longer divides the data across cluster types and it makes it into some strange stacked bar (see image).

enter image description here

So the first question is:

  1. How do I divide the data by migration type and cluster, without coloring in each bar and instead just outlining them in black?

Secondly:

  1. How do I overlay this bar graph on top of the original one?

Solution

  • To add your second bars on top of the first you have to explicitly map on the group aes to still get a dodged bar chart.

    library(ggplot2)
    
    ggplot(expected_data, aes(
      x = resp_migration_status,
      y = percentage, fill = factor(kmcluster)
    )) +
      geom_col(position = "dodge") +
      geom_col(aes(y = expected, group = factor(kmcluster)),
        color = "black", fill = NA, position = "dodge"
      ) +
      labs(
        title = "Percentage distribution of network cluster by migration status",
        x = "Migration Status",
        y = "Percentage",
        fill = "Cluster"
      ) +
      scale_y_continuous(labels = scales::percent_format(scale = 1)) + # Format y-axis as percentages
      theme_minimal() +
      theme(
        axis.text.x = element_text(angle = 45, hjust = 1)
      )
    

    enter image description here

    DATA

    expected_data <- data.frame(
      resp_migration_status = c(
        "Non-migrant", "Non-migrant", "Non-migrant", "Non-migrant",
        "Migrant", "Migrant", "Migrant", "Migrant",
        "Displaced", "Displaced", "Displaced", "Displaced"
      ),
      kmcluster = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L),
      percentage = c(
        21.9, 30.1, 24.7,
        23.3, 41.9, 22.6, 19.4, 16.1, 36.9, 26.2, 11.9, 25
      ),
      expected = c(
        30.5, 27.4, 19.9,
        22.3, 30.5, 27.4, 19.9, 22.3, 30.5, 27.4, 19.9, 22.3
      )
    )