rggplot2distributiongeom-bar

R: ggplot distribution diagram with 'more than limit' bar and geom_vline


colleagues.

I am trying to build a distribution diagram that will satisfy the following conditions:

  1. Shows percentage of the values lying in each bin.
  2. Bin size is specified by user.
  3. All the bars are phisically the same width (in pixels). Last bar is showing pecentage of values that lie beyond specified limit
  4. There is a vertical intercept line that shows median value.

Problem: In order to visualize everything that is 'more than limit' I have to make the x-axis discrete, otherwise the last bar can be literally endless, including all values from limit to maximum. But in order to put vertical intercept at specified point, x-axis should be continuous.

Any ideas how can I workaround it?

Code: Here is code example:

data <- data.frame(value = runif(1000, min = 0, max = 1000))
data$value <- round(data$value, digits = 0)

median_elapsed <- median(data$value)

bin_breaks <- c(seq(0, 
                    median_elapsed, 
                    length.out = 11), 
                Inf)

bin_labels <- c(seq(0, 
                    median_elapsed - (median_elapsed / 10), 
                    length.out = 10), 
                paste0("> ", median_elapsed)) 

data$bins <- cut(data$value, 
                 breaks = bin_breaks, 
                 labels = bin_labels, 
                 include.lowest = TRUE, 
                 right = FALSE) 

get_home_data_percent <- data %>%
  group_by(bins) %>%
  summarize(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

ggplot(get_home_data_percent, aes(x = bins, y = percentage)) +
  geom_bar(stat = "identity", just = 0) +
  scale_x_discrete(drop = FALSE) +
  labs(x = "Elapsed Time", 
       y = "Percentage", 
       title = "Histogram of Elapsed Time") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Illustration: So here I have almost everything needed, but not the vertical line with median value, as the x-axis is discrete.

enter image description here


Solution

  • It's not clear to me that you couldn't put bins on a continuous scale. (Although maybe we'd want some custom axis labeling here for the last bin to clarify its meaning...)

    Here I calculate median for the geom_vline, and separately set a value for the top category that includes all values over that value.

    med = median(data$value)
    upper = 600 
    bin_size = upper / 11
    
    
    library(dplyr); library(ggplot2)
    data |>
      mutate(bin = if_else(value < upper,
                           value %/% bin_size * bin_size, upper)) |>
      summarize(n = n(), .by = bin) |>
      ggplot(aes(bin + bin_size/2, n / sum(n))) +
      geom_col() +
      geom_vline(xintercept = med) +
      scale_x_continuous(breaks = scales::breaks_width(bin_size),
                         labels = scales::number_format(accuracy = 0.1))
    

    enter image description here