[SOLVED] Creating bar charts with binary data

Creating bar charts with binary data

I have the following data , which I am trying to use to create a bar chart from to show how preference of fruit varies with country:

see data table here

I want to create a bar chart that shows preference of apples, oranges, grapes and bananas based on survey location (i.e x= surveyloc and Y = pref freq of oranges, apples, bananas). I am not quite sure how to do this when dealing with binary data and am hoping to get some assistance.

Solution

If you are looking to see preference for multiple variables (ex. fruits) across multiple locations (ex. locations), when only having binary data ("yes" or "no", or 0 vs 1), a bar chart is probably not the best option. My recommendation would be something like a tile plot so that you can convey at a glance preferences across the locations. Here's an example using some dummy data. I'll first show you an example of a bar plot (column plot), then the recommendation I have for you, which would be a tilemap.

Example Dataset

library(ggplot2)
library(dplyr)
library(tidyr)

set.seed(8675309)
df <- data.frame(
  location = state.name[1:10],
  apples = rbinom(10,1,0.3),
  oranges = rbinom(10,1,0.1),
  pears = rbinom(10,1,0.25),
  grapes = rbinom(10,1,0.6),
  mangos = rbinom(10,1,0.65)
)

# tidy data
df <- df %>% pivot_longer(cols = -location) %>%
  mutate(value = factor(value))

I created df above initially in the same format you have for your dataset (location | pref1 | pref2 | pref3 | ...). It's difficult to use ggplot2 to plot this type of data easily, since it is designed to handle what is referred to as Tidy Data. This is overall a better strategy for data management and is adaptable to whatever output you wish - I'd recommend reading that vignette for more info. Needless to say, after the code above we have df formatted as a "tidy" table.

Note I've also turned the binary "value" column into a factor (since it only holds "0" or "1", and values of "0.5" and the like don't make sense here with this data).

"Bar Chart"

I put "bar chart" in quotes, because as we are plotting the value (0 or 1) on the y axis and location on the x axis, we are creating a "column chart". "Bar charts" formally only need a list of values and plot count, density, or probability on the y axis. Regardless, here's an example:

bar_plot <-
  df %>%
  ggplot(aes(x=location, y=value, fill=name)) +
  geom_col(position="dodge", color='gray50', width=0.7) +
  scale_fill_viridis_d()
bar_plot

We could think about just showing where value==1, but that's probably not going to make things clearer.

Example of Tile Plot

What I think works better here is a tilemap. The idea is that you spread location on the x axis and name (of the fruit) on the y axis, and then show the value field as the color of the resulting tiles. I think it makes things a bit easier to view, and it should work pretty much the same if your data is binary or probabilistic. For probability data, you just don't need to conver to a factor first.

tile_plot <-
df %>%
  ggplot(aes(x=location, y=name, fill=value)) +
  geom_tile(color='black') +
  scale_fill_manual(values=c(`0`="gray90", `1`="skyblue")) +
  coord_fixed() +
  scale_x_discrete(expand=expansion(0)) +
  scale_y_discrete(expand=expansion(0))
tile_plot

To explain a little what's going on here is that we setup the aesthetics as indicated above in ggplot(...). Then we draw the tiles with geom_tile(), where the color= represents the line around the tiles. The actual fill colors are described in scale_fill_manual(). The tiles are forced to be "sqare" via coord_fixed(), and then I remove excess area around the tiles via the scale_x_*() and scale_y_*() commands.