rdataframesummeanreorganize

Data Frame: mean over certain variables, ignore but keep others


I am analysing my data with R for the first time which is a bit challenging. I have a data frame with my data that looks like this:

head(data)
      subject group age trial cond acc  rt
    1   S1     2     1     1   1    1   5045
    2   S1     2     1     2   2    1   8034
    3   S1     2     1     3   1    1   6236
    4   S1     2     1     4   2    1   8087
    5   S1     2     1     5   3    0   8756
    6   S1     2     1     6   1    1   6619

I would like to compute a mean and standard deviation for each subject in each condition for rt and a sum for each subject in each condition for acc. All the other variables are should remain the same (group and age are subject-specific, and trial can be disregarded).

I have tried using aggregate but that seemed kind of complicated because I had to do it in several steps and re-add information...

I'd be thankful for any help =)

Edit: I realise that I wasn't being clear. I want trial to be disregarded and end up with one row per subject per condition:

head(data_new)
      subject group age cond rt_mean  rt_sd    acc_sum
    1   S1     2     1  1    7581     100      5
    2   S2     2     1  2    8034     150      4

Sorry about the confusion!


Solution

  • The package dplyr is made for this:

    library(dplyr)
    d %>% 
      group_by(subject, cond) %>% # we group by the two values
      summarise(
        mean_rt = mean(rt, na.rm=T),
        sd_rt = sd(rt, na.rm=T),
        sum_acc = sum(acc, na.rm=T) # here we apply each function to summarise values
      )
    
    
    # A tibble: 3 x 5
    # Groups:   subject [?]
      subject  cond mean_rt sd_rt sum_acc
      <fct>   <int>   <dbl> <dbl>   <int>
    1 S1          1   5967. 821.        3
    2 S1          2   8060.  37.5       2
    3 S1          3   8756   NA         0
    # NA for the last sd_rt is because you can't have 
    # sd for a single obs.
    

    Basically you need to group_by the columns (one or more) that you need to use as grouping, then inside summarise, you apply each function you need (mean, sd, sum, ecc) to each variable (rt, acc, ecc).

    Change summarise with mutate if you want to keep all variables:

    d %>% 
      select(-trial) %>% # use select with -var_name to eliminate columns 
      group_by(subject, cond) %>% 
      mutate(
        mean_rt = mean(rt, na.rm=T),
        sd_rt = sd(rt, na.rm=T),
        sum_acc = sum(acc, na.rm=T)
      ) %>% 
      ungroup()
    # A tibble: 6 x 9
    subject group   age  cond   acc    rt mean_rt sd_rt sum_acc
    <fct>   <int> <int> <int> <int> <int>   <dbl> <dbl>   <int>
    1 S1          2     1     1     1  5045   5967. 821.        3
    2 S1          2     1     2     1  8034   8060.  37.5       2
    3 S1          2     1     1     1  6236   5967. 821.        3
    4 S1          2     1     2     1  8087   8060.  37.5       2
    5 S1          2     1     3     0  8756   8756   NA         0
    6 S1          2     1     1     1  6619   5967. 821.        3
    

    Update based on op request, maybe this is what you need:

    d %>% 
      group_by(subject, cond, group, age) %>% 
      summarise(
        mean_rt = mean(rt, na.rm=T),
        sd_rt = sd(rt, na.rm=T),
        sum_acc = sum(acc, na.rm=T)
      ) 
    # A tibble: 3 x 7
    # Groups:   subject, cond, group [?]
    subject  cond group   age mean_rt sd_rt sum_acc
    <fct>   <int> <int> <int>   <dbl> <dbl>   <int>
    1 S1          1     2     1   5967. 821.        3
    2 S1          2     2     1   8060.  37.5       2
    3 S1          3     2     1   8756   NA         0
    

    Data used:

    tt <- "subject group age trial cond acc  rt
    S1     2     1     1   1    1   5045
    S1     2     1     2   2    1   8034
    S1     2     1     3   1    1   6236
    S1     2     1     4   2    1   8087
    S1     2     1     5   3    0   8756
    S1     2     1     6   1    1   6619"
    
    d <- read.table(text=tt, header=T)