redx

How to obtain conditioned results from an R dataframe


This is my first message here. I'm trying to solve an R exercise from an edX R course, and I'm stuck in it. It would be great if somebody could help me solve it. Here are the dataframe and question given:

> students
   height shoesize gender population
1     181       44   male     kuopio
2     160       38 female     kuopio
3     174       42 female     kuopio
4     170       43   male     kuopio
5     172       43   male     kuopio
6     165       39 female     kuopio
7     161       38 female     kuopio
8     167       38 female    tampere
9     164       39 female    tampere
10    166       38 female    tampere
11    162       37 female    tampere
12    158       36 female    tampere
13    175       42   male    tampere
14    181       44   male    tampere
15    180       43   male    tampere
16    177       43   male    tampere
17    173       41   male    tampere

Given the dataframe above, create two subsets with students whose height is equal to or below the median height (call it students.short) and students whose height is strictly above the median height (call it students.tall). What is the mean shoesize for each of the above 2 subsets by population?

I've been able to create the two subsets students.tall and students.short (both display the answers by TRUE/FALSE), but I don't know how to obtain the mean by population. The data should be displayed like this:

                    kuopio     tampere
students.short      xxxx       xxxx
students.tall       xxxx       xxxx

Many thanks if you can give me a hand!


Solution

  • We can split by a logical vector based on the median height

    # // median height
    medHeight <- median(students$height, na.rm = TRUE)
    
    # // split the data into a list of data.frames using the 'medHeight'
    lst1 <- with(students, split(students, height > medHeight))
    

    Then loop over the list use aggregate from base R

    lapply(lst1, function(dat) aggregate(shoesize ~ population, 
            data = dat, FUN = mean, na.rm = TRUE))
    

    However, we don't need to create two separate datasets or a list. It can be done by grouping with both 'population' and the 'grp' created with logical vector

    library(dplyr)
    students %>%
         group_by(grp = height > medHeight, population) %>%
         summarise(shoesize = mean(shoesize))