Making a long-term forecast with multiple linear regression and lagged variables in R

Hello I am interested in making an hourly electric load forecast for more than 7 months ahead. My data set includes about 5 and a half years of hourly load and temperature data. The model I am trying to implement is a multiple linear regression that includes temperature as an independent variable and month, weekday and hour as classification variables, as well as 24 variables of load lags; lag1 is the value of the electric load in the previous hour, lag2 is the value of energy load 2 hours before the current value and so on.

my_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")

    
library(Hmisc)
mod_lm <- lm(LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + Lag(LOAD,1) + Lag(LOAD, 2) + Lag(LOAD, 3) + Lag(LOAD, 4) + Lag(LOAD,5)+
               Lag(LOAD, 6) + Lag(LOAD, 7) + Lag(LOAD, 8) + Lag(LOAD,9) + Lag(LOAD, 10) + Lag(LOAD,11)+ Lag(LOAD, 12)+
               Lag(LOAD, 13)+ Lag(LOAD, 14) + Lag(LOAD, 15) + Lag(LOAD, 16) + Lag(LOAD, 17) + Lag(LOAD, 18)+
               Lag(LOAD, 19) + Lag(LOAD,20) + Lag(LOAD, 21) + Lag(LOAD, 22) +Lag(LOAD, 23)+ 
               Lag(LOAD,24), data=my_df)

summary(mod_lm)

The model looks like this:

Call:
lm(formula = dyn(LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + 
    lag(LOAD, 1) + lag(LOAD, 2) + lag(LOAD, 3) + lag(LOAD, 4) + 
    lag(LOAD, 5) + lag(LOAD, 6) + lag(LOAD, 7) + lag(LOAD, 8) + 
    lag(LOAD, 9) + lag(LOAD, 10) + lag(LOAD, 11) + lag(LOAD, 
    12) + lag(LOAD, 13) + lag(LOAD, 14) + lag(LOAD, 15) + lag(LOAD, 
    16) + lag(LOAD, 17) + lag(LOAD, 18) + lag(LOAD, 19) + lag(LOAD, 
    20) + lag(LOAD, 21) + lag(LOAD, 22) + lag(LOAD, 23) + lag(LOAD, 
    24)), data = my_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1155.48   -76.38    -3.80    72.12  1540.34 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    96.297801   5.399303  17.835  < 2e-16 ***
TEMPERATURE     0.147311   0.087598   1.682 0.092638 .  
MONTH          -0.013815   0.186592  -0.074 0.940980    
WEEKDAY       -20.396726   0.361845 -56.369  < 2e-16 ***
HOUR            1.290159   0.171243   7.534 5.01e-14 ***
lag(LOAD, 1)    1.375390   0.004447 309.307  < 2e-16 ***
lag(LOAD, 2)   -0.666860   0.007378 -90.379  < 2e-16 ***
lag(LOAD, 3)    0.205219   0.007890  26.010  < 2e-16 ***
lag(LOAD, 4)   -0.176901   0.007905 -22.377  < 2e-16 ***
lag(LOAD, 5)    0.128568   0.007932  16.208  < 2e-16 ***
lag(LOAD, 6)   -0.028096   0.007960  -3.530 0.000417 ***
lag(LOAD, 7)   -0.058609   0.007950  -7.372 1.71e-13 ***
lag(LOAD, 8)    0.164145   0.007905  20.765  < 2e-16 ***
lag(LOAD, 9)   -0.225412   0.007868 -28.650  < 2e-16 ***
lag(LOAD, 10)   0.133046   0.007940  16.757  < 2e-16 ***
lag(LOAD, 11)   0.014815   0.007948   1.864 0.062318 .  
lag(LOAD, 12)  -0.035893   0.007951  -4.515 6.36e-06 ***
lag(LOAD, 13)   0.025532   0.007956   3.209 0.001332 ** 
lag(LOAD, 14)  -0.028748   0.007962  -3.611 0.000306 ***
lag(LOAD, 15)  -0.095531   0.007928 -12.050  < 2e-16 ***
lag(LOAD, 16)   0.227563   0.007876  28.894  < 2e-16 ***
lag(LOAD, 17)  -0.189406   0.007912 -23.939  < 2e-16 ***
lag(LOAD, 18)   0.070704   0.007947   8.897  < 2e-16 ***
lag(LOAD, 19)   0.020112   0.007954   2.528 0.011462 *  
lag(LOAD, 20)  -0.103368   0.007936 -13.025  < 2e-16 ***
lag(LOAD, 21)   0.181176   0.007901  22.931  < 2e-16 ***
lag(LOAD, 22)  -0.204949   0.007907 -25.919  < 2e-16 ***
lag(LOAD, 23)   0.533351   0.007334  72.723  < 2e-16 ***
lag(LOAD, 24)  -0.271700   0.004480 -60.654  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 131.8 on 46795 degrees of freedom
  (24 observations deleted due to missingness)
Multiple R-squared:  0.9871,    Adjusted R-squared:  0.9871 
F-statistic: 1.28e+05 on 28 and 46795 DF,  p-value: < 2.2e-16

How do I form my predict function so that it produces a forecast that has the length of my forecasted temperature table (5736 values) and takes into account "forecasted" lagged load variables? I have been having difficulties using the dyn package , for some reason the lagged variables produce zero estimates.

forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")

This clearly won't work:

pred <-predict(mod_lm, newdata = forecast_df)

Thank you in advance, any idea on the matter is appreciated.

Solution

Using my_df from the question convert it to zoo and then run dyn$lm. The original problem was that there was a datetime field that was character so zoo(my_df) converted it to a character object. If we use read.zoo and use it to convert the datetime field to POSIXct and tell it that that column is the index -- read.zoo assumes that the first column is the index unless specified otherwise -- then it works. Also note that there are date times in the data that conflict with daylight savings time so the time zone must be specified as UTC.

Next follow the code at the end of ?dyn to get predictions. Since this can be very slow with large amounts of data we have just shown 3 predictions below.

As per my comment be sure that dplyr is not loaded since it clobbers lag.

library(dyn)
    
z <- read.zoo(my_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
fo <- LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + lag(LOAD, -(1:24))
fm <- dyn$lm(fo, z)

fc.orig <- read.zoo(forecast_df, format = "%d/%m/%Y %H:%M", tz = "UTC")

# fc <- fc.orig
fc <- head(fc.orig, 3)

LOAD0 <- zoo(cbind(LOAD = 0), time(fc))
both <- rbind(z, cbind(LOAD0, fc))

for(i in seq(nrow(z) + 1, nrow(both))) {
   fit <- dyn$lm(fo, both, subset = seq_len(i-1))
   both[i, "LOAD"] <- tail(predict(fit, both[1:i, ]), 1)
}

# extract the forecast rows    
fc_new <- both[seq(nrow(z) + 1, nrow(both)), ]

Note

my_df <- read.csv(file = 
"https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")
forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")