rggplot2tapply

Emtpy factor level with tapply in after_stat causes hodgepodge


I would like to draw a plot with percentage labels per x-axis group. This works fine without empty groups:

# library
library(ggplot2)
library(reshape2)

# example data from reshape2
str(tips)
#> 'data.frame':    244 obs. of  7 variables:
#>  $ total_bill: num  17 10.3 21 23.7 24.6 ...
#>  $ tip       : num  1.01 1.66 3.5 3.31 3.61 4.71 2 3.12 1.96 3.23 ...
#>  $ sex       : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 2 2 2 2 2 ...
#>  $ smoker    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ day       : Factor w/ 4 levels "Fri","Sat","Sun",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ time      : Factor w/ 2 levels "Dinner","Lunch": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ size      : int  2 3 3 2 4 4 2 4 2 2 ...

# function to count percentage per day
comp_pct <- function(count, day) {
  count / tapply(count, day, sum)[day]
}

# correct plot

ggplot(tips, aes(x = day, group = sex)) +
  geom_bar(aes(y = ..prop.., fill = factor(..group..)), stat = "count") +
  geom_text(aes(
    label = after_stat(scales::percent(comp_pct(count, x))),
    y = ..prop..
  ), stat = "count", position = position_stack(vjust = 0.5)) +
  labs(y = "Percent", fill = "sex") +
  scale_x_discrete(drop=FALSE) +
  scale_y_continuous(labels = scales::percent)

However, after adding an empty level, the labelling with after_stat does not work anymore as expected. I am not sure if this is caused by the ordering of the output from the tapply() in comp_pct. However, I am unable to solve it.


# additional empty level
tips -> tips1

tips1$day <- factor(tips$day, levels=c("NewDay",levels(tips$day)))
levels(tips1$day)
#> [1] "NewDay" "Fri"    "Sat"    "Sun"    "Thur"

# bars OK, labels not OK
ggplot(tips1, aes(x = day, group = sex)) +
  geom_bar(aes(y = ..prop.., fill = factor(..group..)), stat = "count") +
  geom_text(aes(
    label = after_stat(scales::percent(comp_pct(count, x))),
    y = ..prop..
  ), stat = "count", position = position_stack(vjust = 0.5)) +
  labs(y = "Percent", fill = "sex") +
  scale_x_discrete(drop=FALSE) +
  scale_y_continuous(labels = scales::percent)
#> Warning: Removed 2 rows containing missing values (geom_text).

Created on 2022-04-02 by the reprex package (v2.0.1)


Solution

  • The issue is that using count / tapply(count, day, sum)[day] extracts the computed percentages by position (Sorry. My fault. (;) While this works fine for the original dataset, it does not work in the more general case, i.e. an NA is returned for the fourth bar label:

    print(day)
    #> [1] 2 3 4 5 2 3 4 5
    #>         3         4         5      <NA>         3         4         5      <NA> 
    print(count / tapply(count, day, sum)[day])
    #> 0.1034483 0.3684211 0.2903226        NA 0.1149425 0.7763158 0.9354839        NA 
    

    To fix that we have to convert day to a character to extract the percentage values by name:

    library(ggplot2)
    library(reshape2)
    
    # additional empty level
    tips -> tips1
    
    # function to count percentage per day
    comp_pct <- function(count, day) {
      count / tapply(count, day, sum)[as.character(day)]
    }
    
    tips1$day <- factor(tips$day, levels = c("NewDay", levels(tips$day)))
    
    ggplot(tips1, aes(x = day, group = sex)) +
      geom_bar(aes(y = ..prop.., fill = factor(..group..)), stat = "count") +
      geom_text(aes(
        label = after_stat(scales::percent(comp_pct(count, x))),
        y = ..prop..
      ), stat = "count", position = position_stack(vjust = 0.5)) +
      labs(y = "Percent", fill = "sex") +
      scale_x_discrete(drop = FALSE) +
      scale_y_continuous(labels = scales::percent)