rplyrsummarysummarizesummarization

How to summarize several independent variables at once in R?


For example, if the data is like below,

Cultivar=rep(c("CV1","CV2"),each=12)
Nitrogen=rep(rep(c("N0","N1","N2","N3"), each=3),2)
Block=rep(c("I","II","III"),8)
Yield=c(99,109,89,115,142,133,121,157,142,125,150,139,82,104,99,117,
        125,127,145,154,154,151,166,175)
Protein=c(25,35,45,55,44,33,21,57,42,25,50,39,72,14,79,71,25,27,45,54,47,51,66,75)
dataA=data.frame(Cultivar,Nitrogen,Block,Yield,Protein)

I'd like to summarize yield and protein data. So I used the below code.

library (plyr)
dataB=ddply(dataA, c("Cultivar","Nitrogen"), summarise, mean=mean(Yield), 
            sd=sd(Yield), n=length(Yield), se=sd/sqrt(n))
dataC=ddply(dataA, c("Cultivar","Nitrogen"), summarise, mean=mean(Protein), 
            sd=sd(Protein), n=length(Protein), se=sd/sqrt(n))
dataB$Protein=dataC$mean
dataB$Protein_se=dataC$se
dataB

  Cultivar Nitrogen mean        sd n        se  Protein Protein_se
1      CV1       N0   99 10.000000 3  5.773503 35.00000   5.773503
2      CV1       N1  130 13.747727 3  7.937254 44.00000   6.350853
3      CV1       N2  140 18.083141 3 10.440307 40.00000  10.440307
4      CV1       N3  138 12.529964 3  7.234178 38.00000   7.234178
5      CV2       N0   95 11.532563 3  6.658328 55.00000  20.599353
6      CV2       N1  123  5.291503 3  3.055050 41.00000  15.011107
7      CV2       N2  151  5.196152 3  3.000000 48.66667   2.728451
8      CV2       N3  164 12.124356 3  7.000000 64.00000   7.000000

But I believe there are much simple codes to summarize several independent variables at once.

Could you let me know how to do that?

Many thanks,


Solution

  • You could use dplyr::summarize across the desired columns and specify the groups using .by and put all the summary statistics you want in a list:

    library(dplyr)
    
    dataA %>%
      summarize(across(Yield:Protein, 
                       .fns = list(Mean = mean, 
                                   SD = sd, 
                                   n = length,
                                   se = ~ sd(.x)/sqrt(length(.x)))), 
                .by = c("Cultivar", "Nitrogen"))
    

    Output:

     Cultivar Nitrogen Yield_Mean  Yield_SD Yield_n  Yield_se Protein_Mean Protein_SD Protein_n Protein_se
    1      CV1       N0         99 10.000000       3  5.773503     35.00000  10.000000         3   5.773503
    2      CV1       N1        130 13.747727       3  7.937254     44.00000  11.000000         3   6.350853
    3      CV1       N2        140 18.083141       3 10.440307     40.00000  18.083141         3  10.440307
    4      CV1       N3        138 12.529964       3  7.234178     38.00000  12.529964         3   7.234178
    5      CV2       N0         95 11.532563       3  6.658328     55.00000  35.679126         3  20.599353
    6      CV2       N1        123  5.291503       3  3.055050     41.00000  26.000000         3  15.011107
    7      CV2       N2        151  5.196152       3  3.000000     48.66667   4.725816         3   2.728451
    8      CV2       N3        164 12.124356       3  7.000000     64.00000  12.124356         3   7.000000