Problem
I have changed the presentation of my question as there seems to be a lack of clarity.
So, we have thousands of hospitals. Their patients are between the ages of 0 and 100 years old. For each age, they have certain numbers of patients e.g. Hospital1 has 10 patients who are 1 year old, 12 patients who are two-year olds, and 0 patients that are a 100 years old etc.
The above dataset is a small and simplified example, my actual dataset contains data for thousands of hospitals and millions of patients.
Outcome sought
I want to know the median patient age of each hospital.
Solution so far
Expand the table so that there is a separate row for each patient's age and then take the median. This will result in my table having hundreds of millions of rows, so is undesirable.
library(dplyr)
## table
hospital <- c(rep(1:3, each = 10))
patient_age <- c(rep(seq(0, 90, by = 10), 3))
number_patients <- round(runif(30, 0, 100),0)
df <- bind_cols(hospital, patient_age, number_patients)
colnames(df) <- c("hospital", "patient_age", "number_patients")
## my impractical solution
df1 <- filter(df, hospital == 1)
df1a <- rep(df1$patient_age, df1$number_patients)
median(df1a)
## there's no way I can repeat this for each hospital (there are 1000s)
EDIT:
Here's how you calculate the mean patient age by hospital:
df %>%
group_by(hospital) %>%
summarise(
mean_age = sum(patient_age*number_patients)/sum(number_patients)
)
or simply:
df %>%
group_by(hospital) %>%
summarise(
mean_age = mean(rep(patient_age,number_patients))
)
Here's the medians:
df %>%
group_by(hospital) %>%
summarise(
median_age = sort(rep(patient_age,number_patients))[length(rep(patient_age,number_patients))/2]
)
Here, we subset sort(rep(patient_age,number_patients))
on its middle value, which is length(rep(patient_age,number_patients))/2
EDIT 2:
or simply:
df %>%
group_by(hospital) %>%
summarise(
median_age = median(rep(patient_age,number_patients))
)