dplyrcategoriesfrequencyboxplotsample-size

Exclude categories in boxplot according to sample size / frequency


I have a large database from which I would like to create a boxplot:

data: test.hospital

y: test results (%): 1500 samples in total

x1: different years (2011-2017)

x2: different hospitals (30 different hospital names)

The sample size differs a lot across the hospitals, so in some cases there is actually too little data to say anything about the data. Therefore I would like to exclude all the hospitals from my boxplot that have a samples size<15.

So what I would like to do is to create an extra row with frequencies of how many time the hospital is sampled, and use that row to exclude the low sample size for my boxplot..

As you probably get is that I am very new to R, so for most people this is prob. a very easy question... stil I would really like the answer to it...!

Thank yo so much :)


Solution

  • Try to use dplyr package. group_by helps to differentiate among hospitals, mutate counts them, filter picks hospitals with at least 15 observation. %>% is pipe symbol for joining the functions.

    install.packages(dplyr)
    library(dplyr)
    test.hospital.filtered <- group_by(test.hospital, x2) %>%
    mutate(sampled_count = n()) %>%
    filter(sampled_count >= 15)
    

    Now use ggplot for creating boxplots. Years are on x axis, test results are on y axis, filtered hospitals are displayed.

    install.packages(ggplot)
    library(ggplot)
    ggplot(test.hospital.filtered, aes(x = x1, y = y, fill = x2)) +
    geom_boxplot()