rdata.tabletidyverselubridateplotrix

Calculate daily parameters from a dataframe with hourly-values in rows and with several columns of interest


The dataframe df1 summarizes water temperature at different depths (T5m,T15m,T25m,T35m) for every hour (Datetime). As an example of dataframe:

df1<- data.frame(Datetime=c("2016-08-12 12:00:00","2016-08-12 13:00:00","2016-08-12 14:00:00","2016-08-12 15:00:00","2016-08-13 12:00:00","2016-08-13 13:00:00","2016-08-13 14:00:00","2016-08-13 15:00:00"),
                    T5m= c(10,20,20,10,10,20,20,10),
                    T15m=c(10,20,10,20,10,20,10,20),
                    T25m=c(20,20,20,30,20,20,20,30),
                    T35m=c(20,20,10,10,20,20,10,10))
df1$Datetime<- as.POSIXct(df1$Datetime, format="%Y-%m-%d %H")
df1

             Datetime T5m T15m T25m T35m
1 2016-08-12 12:00:00  10   10   20   20
2 2016-08-12 13:00:00  20   20   20   20
3 2016-08-12 14:00:00  20   10   20   10
4 2016-08-12 15:00:00  10   20   30   10
5 2016-08-13 12:00:00  10   10   20   20
6 2016-08-13 13:00:00  20   20   20   20
7 2016-08-13 14:00:00  20   10   20   10
8 2016-08-13 15:00:00  10   20   30   10

I would like to create a new dataframe df2 in which I have the average water temperature per day for either each depth interval and for the whole water column and the standard error estimation. I would expect something like this (I did the calculations by hand so there might be some mistakes):

> df2
        Date meanT5m meanT15m meanT25m meanT35m meanTotal seT5m seT15m seT25m seT35m seTotal
1 2016-08-12      15       15     22.5       15    16.875  2.88   2.88    2.5   2.88    1.29
2 2016-08-13      15       15     22.5       15    16.875  2.88   2.88    2.5   2.88    1.29

I am especially interested in knowing how to do it with data.table since I will work with huge data.frames and I think data.table is quite efficient.

For calculating the standard error I know the function std.error() from the package plotrix.


Solution

  • Update based on @chinsoon's comment


    1. First transform your data frame into a data table:

      library(data.table)
      setDT(df1)
      
    2. Create a total column:

      df1[, total := rowSums(.SD), .SDcols = grep("T[0-9]+m", names(df1))][]
      #               Datetime T5m T15m T25m T35m total
      # 1: 2016-08-12 12:00:00  10   10   20   20    60
      # 2: 2016-08-12 13:00:00  20   20   20   20    80
      # 3: 2016-08-12 14:00:00  20   10   20   10    60
      # 4: 2016-08-12 15:00:00  10   20   30   10    70
      # 5: 2016-08-13 12:00:00  10   10   20   20    60
      # 6: 2016-08-13 13:00:00  20   20   20   20    80
      # 7: 2016-08-13 14:00:00  20   10   20   10    60
      # 8: 2016-08-13 15:00:00  10   20   30   10    70
      
    3. Apply the functions per day:

      library(lubridate)
      (df3 <- df1[, as.list(unlist(lapply(.SD, function (x)
                   c(mean = mean(x), sem = sd(x) / sqrt(length(x)))))), 
             day(Datetime)])
      #    day T5m.mean  T5m.sem T15m.mean T15m.sem T25m.mean T25m.sem T35m.mean
      # 1:  12       15 2.886751        15 2.886751      22.5      2.5        15
      # 2:  13       15 2.886751        15 2.886751      22.5      2.5        15
      #    T35m.sem total.mean total.sem
      # 1: 2.886751       67.5  4.787136
      # 2: 2.886751       67.5  4.787136