rmedian

How can I find median bin of data binned by column in R?


I am seeing several related answers which discuss finding medians or the median bin from data frames where frequencies are available by row, such as:

I am working with Census data where observations are geographies (e.g. states, counties), and columns are numbers of households falling into 16 income buckets. I can find the median income bucket by manually creating cumulative sums of the number of households in those 16 columns (example below with 5 columns), but it's pretty ugly. The suggested solutions for calculating grouped means/medians over rows of observations don't really work, and I can't come up with a simpler way.

library(dplyr)
library(magrittr)

geoid = seq(100)
df_example = data.frame(
  geoid = geoid,
  inc1 = sample(0:200, 100),
  inc2 = sample(0:200, 100),
  inc3 = sample(0:200, 100),
  inc4 = sample(0:200, 100),
  inc5 = sample(0:200, 100)
)
df_example$total = df_example$inc1 + df_example$inc2 + df_example$inc3 + df_example$inc4 + df_example$inc5

df_example = df_example %>% mutate(
  inc1_cumsum = inc1,
  inc2_cumsum = inc1 + inc2,
  inc3_cumsum = inc1 + inc2 + inc3,
  inc4_cumsum = inc1 + inc2 + inc3 + inc4,
  inc5_cumsum = inc1 + inc2 + inc3 + inc4 + inc5
)

df_example = df_example %>% mutate(
  median_inc_bin = case_when(
    total/2 < inc1_cumsum ~ "inc1",
    total/2 < inc2_cumsum ~ "inc2",
    total/2 < inc3_cumsum ~ "inc3",
    total/2 < inc4_cumsum ~ "inc4",
    total/2 < inc5_cumsum ~ "inc5"
  )
)

Solution

  • If we pivot longer, then we can select the first bin per geoid where the cumulative sum exceeds the median.

    df_example |>
      tidyr::pivot_longer(-geoid) |>
      filter(cumsum(value) > sum(value)/2, .by = geoid) |>
      slice_head(n = 1, by = geoid)
    

    When I use this fake data:

    set.seed(42)
    geoid = seq(10)
    df_example = data.frame(
      geoid = geoid,
      inc1 = sample(0:200, 10),
      inc2 = sample(0:200, 10),
      inc3 = sample(0:200, 10),
      inc4 = sample(0:200, 10),
      inc5 = sample(0:200, 10)
    )
    
    #   geoid inc1 inc2 inc3 inc4 inc5
    #1      1   48   70   40   57   67
    #2      2   64   99   88   41  145
    #3      3  152   88   26   23  108
    #4      4   73  164  163  157   91
    #5      5  145  109  108   42  196
    #6      6  121   19    4  142    3
    #7      7  200  153  161  149  177
    #8      8  127  113   91  200   98
    #9      9   46  110  103  135  176
    #10    10   23  130    2   35  153
    

    ...I get this output:

    # A tibble: 10 × 3
       geoid name  value
       <int> <chr> <int>
     1     1 inc3     40
     2     2 inc3     88
     3     3 inc2     88
     4     4 inc3    163
     5     5 inc3    108
     6     6 inc4    142
     7     7 inc3    161
     8     8 inc3     91
     9     9 inc4    135
    10    10 inc4     35
    

    ...which corresponds with the output using your code:

       geoid inc1 inc2 inc3 inc4 inc5 total inc1_cumsum inc2_cumsum inc3_cumsum inc4_cumsum inc5_cumsum median_inc_bin
    1      1   48   70   40   57   67   282          48         118         158         215         282           inc3
    2      2   64   99   88   41  145   437          64         163         251         292         437           inc3
    3      3  152   88   26   23  108   397         152         240         266         289         397           inc2
    4      4   73  164  163  157   91   648          73         237         400         557         648           inc3
    5      5  145  109  108   42  196   600         145         254         362         404         600           inc3
    6      6  121   19    4  142    3   289         121         140         144         286         289           inc4
    7      7  200  153  161  149  177   840         200         353         514         663         840           inc3
    8      8  127  113   91  200   98   629         127         240         331         531         629           inc3
    9      9   46  110  103  135  176   570          46         156         259         394         570           inc4
    10    10   23  130    2   35  153   343          23         153         155         190         343           inc4