rmedianfrequency-table

How to get median with frequency table in R?


Problem

I have changed the presentation of my question as there seems to be a lack of clarity.

So, we have thousands of hospitals. Their patients are between the ages of 0 and 100 years old. For each age, they have certain numbers of patients e.g. Hospital1 has 10 patients who are 1 year old, 12 patients who are two-year olds, and 0 patients that are a 100 years old etc.

enter image description here

The above dataset is a small and simplified example, my actual dataset contains data for thousands of hospitals and millions of patients.

Outcome sought

I want to know the median patient age of each hospital.

Solution so far

Expand the table so that there is a separate row for each patient's age and then take the median. This will result in my table having hundreds of millions of rows, so is undesirable.

library(dplyr)

## table
hospital <- c(rep(1:3, each = 10))
patient_age <- c(rep(seq(0, 90, by = 10), 3))
number_patients <- round(runif(30, 0, 100),0)
df <- bind_cols(hospital, patient_age, number_patients)
colnames(df) <- c("hospital", "patient_age", "number_patients")

## my impractical solution
df1 <- filter(df, hospital == 1)
df1a <- rep(df1$patient_age, df1$number_patients)
median(df1a)

## there's no way I can repeat this for each hospital (there are 1000s) 

Solution

  • EDIT:

    Here's how you calculate the mean patient age by hospital:

    df %>%
      group_by(hospital) %>%
      summarise(
        mean_age = sum(patient_age*number_patients)/sum(number_patients)
        )
    

    or simply:

    df %>%
      group_by(hospital) %>%
      summarise(
        mean_age = mean(rep(patient_age,number_patients))
      )
    

    Here's the medians:

    df %>%
      group_by(hospital) %>%
      summarise(
        median_age = sort(rep(patient_age,number_patients))[length(rep(patient_age,number_patients))/2]
      )
    

    Here, we subset sort(rep(patient_age,number_patients)) on its middle value, which is length(rep(patient_age,number_patients))/2

    EDIT 2:

    or simply:

    df %>%
      group_by(hospital) %>%
      summarise(
        median_age = median(rep(patient_age,number_patients))
      )