rggplot2pivotbar-chartfill

How to fill a barplot, not according to the occurences of an element in a dataset, but according to its value in another variable? See example


Imagine having this dataset:

Country Energy_Source Twh   Tot
Italy   Biofuel        24.5 100
Italy   Nuclear        15.4 100
Italy   Gas            40.1 100
Italy   Hydro          20.0 100
France  Biofuel        20.0 120
France  Nuclear        75.0 120
France  Gas            10.0 120
France  Hydro          4.3  120
France  Wind           10.7 120   

Note: Tot is the sum of Twh by Country

dataset1 <- data.frame(
  "Country" = c(rep(x = "Italy", times = 4), rep(x = "France", times = 5)),
  "Energy_Source" = c("Biofuel", "Nuclear", "Gas", "Hydro", "Biofuel", "Nuclear", "Gas", "Hydro", "Wind"),
  "Twh" = c(25, 15, 40, 20, 20, 75, 10, 5, 10),
  "Tot" = c(rep(x = 100, times = 4), rep(x = 120, times = 5))
)

Now, we want ggplot2 to interpret this dataset1 as if it was like the following (dataset2) without performing a pivot_longer on dataset1

Here the new dataset2 that represents exactly the same informations as dataset1 but with duplicates for ggplot2 to interpret the occurences of each element as a proportion

Country Energy_Source Twh Tot
Italy   Biofuel        25 100
Italy   Biofuel        25 100
Italy   Biofuel        25 100
.
.
. (22 more rows)
Italy   Nuclear        15 100
. (14 more rows)
Italy   Gas            40 100
. (etcetera)
dataset2 <- data.frame(
   "Country" = c(rep(x = "Italy", times = 100), rep(x = "France", times = 120)),
   "Energy_Source" = c(rep(x = "Biofuel", times = 25), rep(x = "Nuclear", times = 15),
   rep(x = "Gas", times = 40), rep(x = "Hydro", times = 20), rep(x = "Biofuel", times = 20),
   rep(x = "Nuclear", times = 75), rep(x = "Gas", times = 10), rep(x = "Hydro", times = 5),
   rep(x = "Wind", times = 10)),
   "Tot" = c(rep(x = 100, times = 100), rep(x = 120, times = 120))
 )

Now, normally we would use the following code to represent the barplots

ggplot(data = dataset2, mapping = aes(
                                   x = Tot, 
                                   y = reorder(Country, Tot), 
                                   fill = Energy_Source
                                 )) +
  geom_col()

See here:

output_plot

But is it possible to use dataset1 and not dataset2 to create the same graph with ggplot2?

In other terms:

How to fill a barplot, not according to the occurences of an element in a dataset, but according to its value in another variable?

Thanks!

I tried performing a pivot_longer from the tidyr package but it was too costly for my Shiny App.


Solution

  • Here are two ways to recreate your plot using dataset1.

    1. Scale in proportion to Twh. This seems simplest and most efficient, provided you don't need the visible bars to be composed of many stacked smaller bars.

    ggplot(dataset1, aes(
      x = Tot*Twh, 
      y = reorder(Country, Tot), 
      fill = Energy_Source
    )) +
      geom_col()
    

    enter image description here

    1. tidyr::uncount is what you want if you want to make copies of each observation. This replicates your dataset2 approach. I have added borders to show how this makes many small bars that are stacked together. This approach is fine here, but I've had issues where it might plot very slowly (e.g. if >100k observations to plot), or plot messily (e.g. if the borders overwhelm the areas or create moire effects), or inefficiently (e.g. a vector format like PDF would save a separate object for each bar plotted, even if <1 pixel).

    ggplot(dataset1 |> tidyr::uncount(Twh), aes(
      x = Tot, 
      y = reorder(Country, Tot), 
      fill = Energy_Source
    )) +
      geom_col(color = "gray50")
    

    enter image description here


    Edit: For the OP's description of "sequential" data, I think it's more efficient computationally and cleaner to plot if you calculate the aggregations (here, the average usage across years of energy source per country) with a dplyr step.

    Compare a version of what's in the OP's suggested answer:

    rbind(dataset1, dataset1b) |>
      mutate(across(Twh, ~.x * 10)) |>
      uncount(Twh) |>
      ggplot(aes(
        x = Tot / (10 * 2), 
        y = reorder(Country, Tot), 
        fill = Energy_Source
      )) +
      geom_col()
    

    ...to a version using dplyr. Same superficial visual appearance, except the uncount version plots 4,400 observations, vs. the dplyr version just plots the 9 contiguous bars we can see.

    rbind(dataset1, dataset1b) |>
      summarize(Total = mean(Tot * Twh), .by = c(Country, Energy_Source)) |> 
      ggplot(aes(
        x = Total,
        y = reorder(Country, Total), 
        fill = Energy_Source
      )) +
      geom_col()
    

    For reference, your example plot with the same dimensions:

    enter image description here