raveragerolling-average

How to produce a year over year calculated column in R


First, is the data and then the manipulations. Finally, is the current method that I am using and as of yet is producing no data. The manipulations are to create a date and then create a rolling 12-Month average.

   Monthavg<- 
   c(20185,20186,20187,20188,20189,201810,201811,201812,20191,20192,20193,20194,20195,20196,
      20197,20198,20199,201910,201911,201912,20201
      ,20202,20203,20204,20205,20206,20207
      ,20208,20209,202010,202011)

  empavg<-c(2,4,6,7,8,10,12,14,16,18,20,22,24,26,28,30,32,36,36,38,40,42,44,46,48,48,50,52,52,54,56)

  ces12f <- data.frame(Monthavg,empavg)

Manipulations

 ces12f<- ces12f %>% mutate(year = substr(as.character(Monthavg),1,4),
              month = substr(as.character(Monthavg),5,7),
              date = as.Date(paste(year,month,"1",sep ="-")))
 Month_ord <- order(Monthavg)
 span_month=12
 ces12f<-ces12f %>% mutate(ravg = zoo::rollmeanr(empavg, 12, fill = NA))

Annual difference attempt

 ces12f<- ces12f%>%
 group_by(Monthavg)%>%
 mutate(PreviousYear=lag(ravg,12), 
     PreviousMonth=lag(ravg),
     AnnualDifference=ravg-PreviousYear)%>%
 ungroup()

The end goal would be that 202011 minus 201911 or 47.5 minus 25.17 or 22.3. The method that I use above produces nothing but NA's. Any insights as to how I can modify my existing code or simply use an entirely different method would be greatly appreciated.


Solution

  • I tend to be a little more paranoid. That is, if there is even a slight chance that we are missing one month of however many years we have, than doing a lag(..., 12) is a bad idea, even worse because you will get no warnings or errors, and your data will be wrong.

    As such, I'm going to recommend a self-join.

    transmute(ces12f, year = as.character(as.integer(year) + 1L), month, lastravg = ravg) %>%
      left_join(ces12f, ., by = c("year", "month"))
    #    Monthavg empavg year month       date     ravg lastravg
    # 1     20185      2 2018     5 2018-05-01       NA       NA
    # 2     20186      4 2018     6 2018-06-01       NA       NA
    # 3     20187      6 2018     7 2018-07-01       NA       NA
    # 4     20188      7 2018     8 2018-08-01       NA       NA
    # 5     20189      8 2018     9 2018-09-01       NA       NA
    # 6    201810     10 2018    10 2018-10-01       NA       NA
    # 7    201811     12 2018    11 2018-11-01       NA       NA
    # 8    201812     14 2018    12 2018-12-01       NA       NA
    # 9     20191     16 2019     1 2019-01-01       NA       NA
    # 10    20192     18 2019     2 2019-02-01       NA       NA
    # 11    20193     20 2019     3 2019-03-01       NA       NA
    # 12    20194     22 2019     4 2019-04-01 11.58333       NA
    # 13    20195     24 2019     5 2019-05-01 13.41667       NA
    # 14    20196     26 2019     6 2019-06-01 15.25000       NA
    # 15    20197     28 2019     7 2019-07-01 17.08333       NA
    # 16    20198     30 2019     8 2019-08-01 19.00000       NA
    # 17    20199     32 2019     9 2019-09-01 21.00000       NA
    # 18   201910     36 2019    10 2019-10-01 23.16667       NA
    # 19   201911     36 2019    11 2019-11-01 25.16667       NA
    # 20   201912     38 2019    12 2019-12-01 27.16667       NA
    # 21    20201     40 2020     1 2020-01-01 29.16667       NA
    # 22    20202     42 2020     2 2020-02-01 31.16667       NA
    # 23    20203     44 2020     3 2020-03-01 33.16667       NA
    # 24    20204     46 2020     4 2020-04-01 35.16667 11.58333
    # 25    20205     48 2020     5 2020-05-01 37.16667 13.41667
    # 26    20206     48 2020     6 2020-06-01 39.00000 15.25000
    # 27    20207     50 2020     7 2020-07-01 40.83333 17.08333
    # 28    20208     52 2020     8 2020-08-01 42.66667 19.00000
    # 29    20209     52 2020     9 2020-09-01 44.33333 21.00000
    # 30   202010     54 2020    10 2020-10-01 45.83333 23.16667
    # 31   202011     56 2020    11 2020-11-01 47.50000 25.16667
    

    You can verify that each lastempavg is the previous year's value, and you can mutate the difference normally, perhaps

    transmute(ces12f, year = as.character(as.integer(year) + 1L), month, lastravg = ravg) %>%
      left_join(ces12f, ., by = c("year", "month")) %>%
      mutate(AnnualDifference = ravg - lastravg)
    #    Monthavg empavg year month       date     ravg lastravg AnnualDifference
    # 1     20185      2 2018     5 2018-05-01       NA       NA               NA
    # 2     20186      4 2018     6 2018-06-01       NA       NA               NA
    # 3     20187      6 2018     7 2018-07-01       NA       NA               NA
    # 4     20188      7 2018     8 2018-08-01       NA       NA               NA
    # 5     20189      8 2018     9 2018-09-01       NA       NA               NA
    # 6    201810     10 2018    10 2018-10-01       NA       NA               NA
    # 7    201811     12 2018    11 2018-11-01       NA       NA               NA
    # 8    201812     14 2018    12 2018-12-01       NA       NA               NA
    # 9     20191     16 2019     1 2019-01-01       NA       NA               NA
    # 10    20192     18 2019     2 2019-02-01       NA       NA               NA
    # 11    20193     20 2019     3 2019-03-01       NA       NA               NA
    # 12    20194     22 2019     4 2019-04-01 11.58333       NA               NA
    # 13    20195     24 2019     5 2019-05-01 13.41667       NA               NA
    # 14    20196     26 2019     6 2019-06-01 15.25000       NA               NA
    # 15    20197     28 2019     7 2019-07-01 17.08333       NA               NA
    # 16    20198     30 2019     8 2019-08-01 19.00000       NA               NA
    # 17    20199     32 2019     9 2019-09-01 21.00000       NA               NA
    # 18   201910     36 2019    10 2019-10-01 23.16667       NA               NA
    # 19   201911     36 2019    11 2019-11-01 25.16667       NA               NA
    # 20   201912     38 2019    12 2019-12-01 27.16667       NA               NA
    # 21    20201     40 2020     1 2020-01-01 29.16667       NA               NA
    # 22    20202     42 2020     2 2020-02-01 31.16667       NA               NA
    # 23    20203     44 2020     3 2020-03-01 33.16667       NA               NA
    # 24    20204     46 2020     4 2020-04-01 35.16667 11.58333         23.58333
    # 25    20205     48 2020     5 2020-05-01 37.16667 13.41667         23.75000
    # 26    20206     48 2020     6 2020-06-01 39.00000 15.25000         23.75000
    # 27    20207     50 2020     7 2020-07-01 40.83333 17.08333         23.75000
    # 28    20208     52 2020     8 2020-08-01 42.66667 19.00000         23.66667
    # 29    20209     52 2020     9 2020-09-01 44.33333 21.00000         23.33333
    # 30   202010     54 2020    10 2020-10-01 45.83333 23.16667         22.66667
    # 31   202011     56 2020    11 2020-11-01 47.50000 25.16667         22.33333
    

    Side note on this: it might be better to keep the year and month stored as integer, for a few reasons: (1) it makes this kind of thing quite easy; (2) it preserves ordinality, whereas arrange(ces12f, month) will happily order the months as 1, 10, 11, 12, 2, etc; (3) (subjective) they really are integers, after all.