Imagine having this dataset:
Country Energy_Source Twh Tot
Italy Biofuel 24.5 100
Italy Nuclear 15.4 100
Italy Gas 40.1 100
Italy Hydro 20.0 100
France Biofuel 20.0 120
France Nuclear 75.0 120
France Gas 10.0 120
France Hydro 4.3 120
France Wind 10.7 120
Note: Tot
is the sum of Twh
by Country
dataset1 <- data.frame(
"Country" = c(rep(x = "Italy", times = 4), rep(x = "France", times = 5)),
"Energy_Source" = c("Biofuel", "Nuclear", "Gas", "Hydro", "Biofuel", "Nuclear", "Gas", "Hydro", "Wind"),
"Twh" = c(25, 15, 40, 20, 20, 75, 10, 5, 10),
"Tot" = c(rep(x = 100, times = 4), rep(x = 120, times = 5))
)
Now, we want ggplot2
to interpret this dataset1
as if it was like the following (dataset2
) without performing a pivot_longer
on dataset1
Here the new dataset2
that represents exactly the same informations as dataset1
but with duplicates for ggplot2
to interpret the occurences of each element as a proportion
Country Energy_Source Twh Tot
Italy Biofuel 25 100
Italy Biofuel 25 100
Italy Biofuel 25 100
.
.
. (22 more rows)
Italy Nuclear 15 100
. (14 more rows)
Italy Gas 40 100
. (etcetera)
dataset2 <- data.frame(
"Country" = c(rep(x = "Italy", times = 100), rep(x = "France", times = 120)),
"Energy_Source" = c(rep(x = "Biofuel", times = 25), rep(x = "Nuclear", times = 15),
rep(x = "Gas", times = 40), rep(x = "Hydro", times = 20), rep(x = "Biofuel", times = 20),
rep(x = "Nuclear", times = 75), rep(x = "Gas", times = 10), rep(x = "Hydro", times = 5),
rep(x = "Wind", times = 10)),
"Tot" = c(rep(x = 100, times = 100), rep(x = 120, times = 120))
)
Now, normally we would use the following code to represent the barplots
ggplot(data = dataset2, mapping = aes(
x = Tot,
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col()
See here:
But is it possible to use dataset1
and not dataset2
to create the same graph with ggplot2
?
In other terms:
How to fill a barplot, not according to the occurences of an element in a dataset, but according to its value in another variable?
Thanks!
I tried performing a pivot_longer
from the tidyr
package but it was too costly for my Shiny App.
Here are two ways to recreate your plot using dataset1
.
Twh
. This seems simplest and most efficient, provided you don't need the visible bars to be composed of many stacked smaller bars.ggplot(dataset1, aes(
x = Tot*Twh,
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col()
tidyr::uncount
is what you want if you want to make copies of each observation. This replicates your dataset2
approach. I have added borders to show how this makes many small bars that are stacked together. This approach is fine here, but I've had issues where it might plot very slowly (e.g. if >100k observations to plot), or plot messily (e.g. if the borders overwhelm the areas or create moire effects), or inefficiently (e.g. a vector format like PDF would save a separate object for each bar plotted, even if <1 pixel).ggplot(dataset1 |> tidyr::uncount(Twh), aes(
x = Tot,
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col(color = "gray50")
Edit: For the OP's description of "sequential" data, I think it's more efficient computationally and cleaner to plot if you calculate the aggregations (here, the average usage across years of energy source per country) with a dplyr step.
Compare a version of what's in the OP's suggested answer:
rbind(dataset1, dataset1b) |>
mutate(across(Twh, ~.x * 10)) |>
uncount(Twh) |>
ggplot(aes(
x = Tot / (10 * 2),
y = reorder(Country, Tot),
fill = Energy_Source
)) +
geom_col()
...to a version using dplyr. Same superficial visual appearance, except the uncount
version plots 4,400 observations, vs. the dplyr version just plots the 9 contiguous bars we can see.
rbind(dataset1, dataset1b) |>
summarize(Total = mean(Tot * Twh), .by = c(Country, Energy_Source)) |>
ggplot(aes(
x = Total,
y = reorder(Country, Total),
fill = Energy_Source
)) +
geom_col()
For reference, your example plot with the same dimensions: