rggplot2

Creating bar charts with binary data


I have the following data , which I am trying to use to create a bar chart from to show how preference of fruit varies with country:

see data table here

I want to create a bar chart that shows preference of apples, oranges, grapes and bananas based on survey location (i.e x= surveyloc and Y = pref freq of oranges, apples, bananas). I am not quite sure how to do this when dealing with binary data and am hoping to get some assistance.


Solution

  • If you are looking to see preference for multiple variables (ex. fruits) across multiple locations (ex. locations), when only having binary data ("yes" or "no", or 0 vs 1), a bar chart is probably not the best option. My recommendation would be something like a tile plot so that you can convey at a glance preferences across the locations. Here's an example using some dummy data. I'll first show you an example of a bar plot (column plot), then the recommendation I have for you, which would be a tilemap.

    Example Dataset

    library(ggplot2)
    library(dplyr)
    library(tidyr)
    
    set.seed(8675309)
    df <- data.frame(
      location = state.name[1:10],
      apples = rbinom(10,1,0.3),
      oranges = rbinom(10,1,0.1),
      pears = rbinom(10,1,0.25),
      grapes = rbinom(10,1,0.6),
      mangos = rbinom(10,1,0.65)
    )
    
    # tidy data
    df <- df %>% pivot_longer(cols = -location) %>%
      mutate(value = factor(value))
    

    I created df above initially in the same format you have for your dataset (location | pref1 | pref2 | pref3 | ...). It's difficult to use ggplot2 to plot this type of data easily, since it is designed to handle what is referred to as Tidy Data. This is overall a better strategy for data management and is adaptable to whatever output you wish - I'd recommend reading that vignette for more info. Needless to say, after the code above we have df formatted as a "tidy" table.

    Note I've also turned the binary "value" column into a factor (since it only holds "0" or "1", and values of "0.5" and the like don't make sense here with this data).

    "Bar Chart"

    I put "bar chart" in quotes, because as we are plotting the value (0 or 1) on the y axis and location on the x axis, we are creating a "column chart". "Bar charts" formally only need a list of values and plot count, density, or probability on the y axis. Regardless, here's an example:

    bar_plot <-
      df %>%
      ggplot(aes(x=location, y=value, fill=name)) +
      geom_col(position="dodge", color='gray50', width=0.7) +
      scale_fill_viridis_d()
    bar_plot
    

    enter image description here

    We could think about just showing where value==1, but that's probably not going to make things clearer.

    Example of Tile Plot

    What I think works better here is a tilemap. The idea is that you spread location on the x axis and name (of the fruit) on the y axis, and then show the value field as the color of the resulting tiles. I think it makes things a bit easier to view, and it should work pretty much the same if your data is binary or probabilistic. For probability data, you just don't need to conver to a factor first.

    tile_plot <-
    df %>%
      ggplot(aes(x=location, y=name, fill=value)) +
      geom_tile(color='black') +
      scale_fill_manual(values=c(`0`="gray90", `1`="skyblue")) +
      coord_fixed() +
      scale_x_discrete(expand=expansion(0)) +
      scale_y_discrete(expand=expansion(0))
    tile_plot
    

    enter image description here

    To explain a little what's going on here is that we setup the aesthetics as indicated above in ggplot(...). Then we draw the tiles with geom_tile(), where the color= represents the line around the tiles. The actual fill colors are described in scale_fill_manual(). The tiles are forced to be "sqare" via coord_fixed(), and then I remove excess area around the tiles via the scale_x_*() and scale_y_*() commands.