I am seeing several related answers which discuss finding medians or the median bin from data frames where frequencies are available by row, such as:
I am working with Census data where observations are geographies (e.g. states, counties), and columns are numbers of households falling into 16 income buckets. I can find the median income bucket by manually creating cumulative sums of the number of households in those 16 columns (example below with 5 columns), but it's pretty ugly. The suggested solutions for calculating grouped means/medians over rows of observations don't really work, and I can't come up with a simpler way.
library(dplyr)
library(magrittr)
geoid = seq(100)
df_example = data.frame(
geoid = geoid,
inc1 = sample(0:200, 100),
inc2 = sample(0:200, 100),
inc3 = sample(0:200, 100),
inc4 = sample(0:200, 100),
inc5 = sample(0:200, 100)
)
df_example$total = df_example$inc1 + df_example$inc2 + df_example$inc3 + df_example$inc4 + df_example$inc5
df_example = df_example %>% mutate(
inc1_cumsum = inc1,
inc2_cumsum = inc1 + inc2,
inc3_cumsum = inc1 + inc2 + inc3,
inc4_cumsum = inc1 + inc2 + inc3 + inc4,
inc5_cumsum = inc1 + inc2 + inc3 + inc4 + inc5
)
df_example = df_example %>% mutate(
median_inc_bin = case_when(
total/2 < inc1_cumsum ~ "inc1",
total/2 < inc2_cumsum ~ "inc2",
total/2 < inc3_cumsum ~ "inc3",
total/2 < inc4_cumsum ~ "inc4",
total/2 < inc5_cumsum ~ "inc5"
)
)
If we pivot longer, then we can select the first bin per geoid
where the cumulative sum exceeds the median.
df_example |>
tidyr::pivot_longer(-geoid) |>
filter(cumsum(value) > sum(value)/2, .by = geoid) |>
slice_head(n = 1, by = geoid)
When I use this fake data:
set.seed(42)
geoid = seq(10)
df_example = data.frame(
geoid = geoid,
inc1 = sample(0:200, 10),
inc2 = sample(0:200, 10),
inc3 = sample(0:200, 10),
inc4 = sample(0:200, 10),
inc5 = sample(0:200, 10)
)
# geoid inc1 inc2 inc3 inc4 inc5
#1 1 48 70 40 57 67
#2 2 64 99 88 41 145
#3 3 152 88 26 23 108
#4 4 73 164 163 157 91
#5 5 145 109 108 42 196
#6 6 121 19 4 142 3
#7 7 200 153 161 149 177
#8 8 127 113 91 200 98
#9 9 46 110 103 135 176
#10 10 23 130 2 35 153
...I get this output:
# A tibble: 10 × 3
geoid name value
<int> <chr> <int>
1 1 inc3 40
2 2 inc3 88
3 3 inc2 88
4 4 inc3 163
5 5 inc3 108
6 6 inc4 142
7 7 inc3 161
8 8 inc3 91
9 9 inc4 135
10 10 inc4 35
...which corresponds with the output using your code:
geoid inc1 inc2 inc3 inc4 inc5 total inc1_cumsum inc2_cumsum inc3_cumsum inc4_cumsum inc5_cumsum median_inc_bin
1 1 48 70 40 57 67 282 48 118 158 215 282 inc3
2 2 64 99 88 41 145 437 64 163 251 292 437 inc3
3 3 152 88 26 23 108 397 152 240 266 289 397 inc2
4 4 73 164 163 157 91 648 73 237 400 557 648 inc3
5 5 145 109 108 42 196 600 145 254 362 404 600 inc3
6 6 121 19 4 142 3 289 121 140 144 286 289 inc4
7 7 200 153 161 149 177 840 200 353 514 663 840 inc3
8 8 127 113 91 200 98 629 127 240 331 531 629 inc3
9 9 46 110 103 135 176 570 46 156 259 394 570 inc4
10 10 23 130 2 35 153 343 23 153 155 190 343 inc4