rggplot2ggalluvial

how to calculate percentages in geom_flow() based on variable size and not stratum size


I am trying the create an alluvial plot with geom_flow() and display the percentages of the flows. This works, however if I use more than two variables, I noticed that the percentages in the middle columns are smaller than I expected, to be exact they are exactly half of the value I would expect. In the following example you can see that going from V1 to V2 the outgoing percentage of values who stayed A is 26%, but the incoming percentage in V2 is 13%. I think it calculates the percentages based on the size of the stratum, which sums up all values from the incoming and outgoing flows and is therefore 200 instead of 100. How can I display percentages based on the variable size and the stratum size?

library(ggplot2)
library(ggalluvial)

set.seed(123)
df <- data.frame(id = rep(1:100, 3), 
                 value = sample(c("A", "B"), replace = TRUE, 300),
                 variable = rep(c("V1", "V2", "V3"), each = 100),
                 N = 1
)

ggplot(df, aes(x = variable, stratum = value, alluvium = id, y = N, fill = value)) +
  geom_lode() + 
  geom_flow() +
  geom_stratum(alpha = 0) + 
  geom_text(stat = "flow", aes(
    label = scales::percent(after_stat(prop), accuracy = 1),
    hjust = after_stat(flow) == "to"
  ))

Created on 2023-07-28 with reprex v2.0.2

I know that I could calculate the percentages if I specify the group size when calculating the percentages label = scales::percent(after_stat(count)/100, accuracy = 1), however later on I want to uses facets with different group sizes and therefore I need a solution which calculates the percentages based on the given data.


Solution

  • You are right. The prop computed under the hood does not take the flow into account.

    One option to fix that would be to manually compute the percentages inside after_stat() for which I use ave:

    library(ggplot2)
    library(ggalluvial)
    
    set.seed(123)
    df <- data.frame(
      id = rep(1:100, 3),
      value = sample(c("A", "B"), replace = TRUE, 300),
      variable = rep(c("V1", "V2", "V3"), each = 100),
      N = 1
    )
    
    library(ggplot2)
    library(ggalluvial)
    
    ggplot(df, aes(
      x = variable, stratum = value, alluvium = id,
      y = N, fill = value
    )) +
      geom_lode() +
      geom_flow() +
      geom_stratum(alpha = 0) +
      geom_text(stat = "flow", aes(
        label = after_stat(
          scales::percent(
            ave(count, x, flow, group, FUN = sum) /
              ave(count, x, flow, FUN = sum),
            accuracy = 1
          )
        ),
        hjust = after_stat(flow) == "to"
      ))