rggplot2tidyversegeom-col

Get row columns by group for geom_col in ggplot


I am trying to calculate row percentages by demographics of various score levels--in my data, that would be what % of white people (or % of black people, or % male, or % who have education level 2, and so on) have a score of 0 (or 1, 2, or 3)--and then use that to create a big plot.

So in my example data below, 8.33% of race == 1 (which is white) have a score of 0, 25% have a score of 1, 25% have a score of 2, and 41.67% have a score of 3.

Then the ultimate end goal would be to get some type of bar plot where the 4 levels of 'score' are across the x axis, and the various comparisons of demographics run down the y axis. Something that looks visually sort of like this, but with the levels of 'score' across the top instead of education levels: enter image description here .

I already have some code to make the actual figure, which I've done in other instances but with externally/already-calculated percentages:

ggplot(data, aes(x = percent, y = category, fill = group)) +
  geom_col(orientation = "y", width = .9) +
  facet_grid(group~score_var, 
             scales = "free_y", space = "free_y") +
  labs(title = "Demographic breakdown of 'Score'") +
  theme_bw()

I am just struggling to figure out the best way to calculate these row percentages, presumably using group_by() and summarize and then storing or configuring them in a way that they can be plotted. Thank you.

d <- structure(list(race = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 
1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 
3, 3), gender = c(0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 
0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1
), education = c(1, 3, 3, 2, 1, 3, 2, 3, 4, 4, 2, 3, 3, 2, 3, 
4, 1, 3, 1, 3, 3, 2, 1, 3, 2, 3, 4, 4, 2, 3, 3, 2, 3, 4, 1, 3
), score = c(1, 2, 2, 1, 2, 3, 3, 2, 0, 0, 1, 2, 1, 3, 0, 0, 
3, 3, 3, 3, 3, 3, 3, 3, 2, 1, 2, 3, 1, 3, 3, 0, 1, 2, 2, 0)), row.names = c(NA, 
-36L), spec = structure(list(cols = list(race = structure(list(), class = c("collector_double", 
"collector")), gender = structure(list(), class = c("collector_double", 
"collector")), education = structure(list(), class = c("collector_double", 
"collector")), score = structure(list(), class = c("collector_double", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), delim = ","), class = "col_spec"), problems = <pointer: 0x000001bd978b0df0>, class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"))

Solution

  • Taking inspiration from @Zhiqiang Wang's excellent first pass at this, I finally figured out a solution. I still need to change the order of the labels (to put the education levels in order, and move the race variables to the top of the figure) but this is basically what I was envisioning.

    d_test <- d %>% mutate(
            gender = factor(2-gender), 
            race = factor(race), 
            education = factor(education)) %>%
        pivot_longer(!score, names_to = "group", values_to = "levels")
    
    
    d_test <- d_test %>% group_by(group, levels) %>% 
        mutate(group_n  = n()) %>% 
        group_by(group, levels, score) %>% 
        summarise(percent = round(100*n()/group_n[1], 1))
    
    d_test <- d_test %>% 
      mutate(var = case_when(group == "gender" & levels == 1 ~ "female",
                             group == "gender" & levels == 2 ~ "male",
                             group == "race" & levels == 1 ~ "white",
                             group == "race" & levels == 2 ~ "black",
                             group == "race" & levels == 3 ~ "hispanic",
                             group == "education" & levels == 1 ~ "dropout HS",
                             group == "education" & levels == 2 ~ "grad HS",
                             group == "education" & levels == 3 ~ "some coll",
                             group == "education" & levels == 4 ~ "grad coll"))
    
    ggplot(d_test, aes(x = percent, y = var, fill = group)) +
      geom_col(orientation = "y", width = .9) +
      facet_grid(group ~ score,
                   scales = "free_y", space = "free_y") +
      labs(title = "Demographic breakdown of 'Score'",
             y = "",
             x = "Percent") +
      theme_minimal() +
      theme(legend.position = "none",
            strip.text.y = element_blank())
    
    

    enter image description here