Hello I am interested in making an hourly electric load forecast for more than 7 months ahead. My data set includes about 5 and a half years of hourly load and temperature data. The model I am trying to implement is a multiple linear regression that includes temperature as an independent variable and month, weekday and hour as classification variables, as well as 24 variables of load lags; lag1 is the value of the electric load in the previous hour, lag2 is the value of energy load 2 hours before the current value and so on.
my_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")
library(Hmisc)
mod_lm <- lm(LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + Lag(LOAD,1) + Lag(LOAD, 2) + Lag(LOAD, 3) + Lag(LOAD, 4) + Lag(LOAD,5)+
Lag(LOAD, 6) + Lag(LOAD, 7) + Lag(LOAD, 8) + Lag(LOAD,9) + Lag(LOAD, 10) + Lag(LOAD,11)+ Lag(LOAD, 12)+
Lag(LOAD, 13)+ Lag(LOAD, 14) + Lag(LOAD, 15) + Lag(LOAD, 16) + Lag(LOAD, 17) + Lag(LOAD, 18)+
Lag(LOAD, 19) + Lag(LOAD,20) + Lag(LOAD, 21) + Lag(LOAD, 22) +Lag(LOAD, 23)+
Lag(LOAD,24), data=my_df)
summary(mod_lm)
The model looks like this:
Call:
lm(formula = dyn(LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR +
lag(LOAD, 1) + lag(LOAD, 2) + lag(LOAD, 3) + lag(LOAD, 4) +
lag(LOAD, 5) + lag(LOAD, 6) + lag(LOAD, 7) + lag(LOAD, 8) +
lag(LOAD, 9) + lag(LOAD, 10) + lag(LOAD, 11) + lag(LOAD,
12) + lag(LOAD, 13) + lag(LOAD, 14) + lag(LOAD, 15) + lag(LOAD,
16) + lag(LOAD, 17) + lag(LOAD, 18) + lag(LOAD, 19) + lag(LOAD,
20) + lag(LOAD, 21) + lag(LOAD, 22) + lag(LOAD, 23) + lag(LOAD,
24)), data = my_df)
Residuals:
Min 1Q Median 3Q Max
-1155.48 -76.38 -3.80 72.12 1540.34
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.297801 5.399303 17.835 < 2e-16 ***
TEMPERATURE 0.147311 0.087598 1.682 0.092638 .
MONTH -0.013815 0.186592 -0.074 0.940980
WEEKDAY -20.396726 0.361845 -56.369 < 2e-16 ***
HOUR 1.290159 0.171243 7.534 5.01e-14 ***
lag(LOAD, 1) 1.375390 0.004447 309.307 < 2e-16 ***
lag(LOAD, 2) -0.666860 0.007378 -90.379 < 2e-16 ***
lag(LOAD, 3) 0.205219 0.007890 26.010 < 2e-16 ***
lag(LOAD, 4) -0.176901 0.007905 -22.377 < 2e-16 ***
lag(LOAD, 5) 0.128568 0.007932 16.208 < 2e-16 ***
lag(LOAD, 6) -0.028096 0.007960 -3.530 0.000417 ***
lag(LOAD, 7) -0.058609 0.007950 -7.372 1.71e-13 ***
lag(LOAD, 8) 0.164145 0.007905 20.765 < 2e-16 ***
lag(LOAD, 9) -0.225412 0.007868 -28.650 < 2e-16 ***
lag(LOAD, 10) 0.133046 0.007940 16.757 < 2e-16 ***
lag(LOAD, 11) 0.014815 0.007948 1.864 0.062318 .
lag(LOAD, 12) -0.035893 0.007951 -4.515 6.36e-06 ***
lag(LOAD, 13) 0.025532 0.007956 3.209 0.001332 **
lag(LOAD, 14) -0.028748 0.007962 -3.611 0.000306 ***
lag(LOAD, 15) -0.095531 0.007928 -12.050 < 2e-16 ***
lag(LOAD, 16) 0.227563 0.007876 28.894 < 2e-16 ***
lag(LOAD, 17) -0.189406 0.007912 -23.939 < 2e-16 ***
lag(LOAD, 18) 0.070704 0.007947 8.897 < 2e-16 ***
lag(LOAD, 19) 0.020112 0.007954 2.528 0.011462 *
lag(LOAD, 20) -0.103368 0.007936 -13.025 < 2e-16 ***
lag(LOAD, 21) 0.181176 0.007901 22.931 < 2e-16 ***
lag(LOAD, 22) -0.204949 0.007907 -25.919 < 2e-16 ***
lag(LOAD, 23) 0.533351 0.007334 72.723 < 2e-16 ***
lag(LOAD, 24) -0.271700 0.004480 -60.654 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 131.8 on 46795 degrees of freedom
(24 observations deleted due to missingness)
Multiple R-squared: 0.9871, Adjusted R-squared: 0.9871
F-statistic: 1.28e+05 on 28 and 46795 DF, p-value: < 2.2e-16
How do I form my predict
function so that it produces a forecast that has the length of my forecasted temperature table (5736 values) and takes into account "forecasted" lagged load variables? I have been having difficulties using the dyn package , for some reason the lagged variables produce zero estimates.
forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")
This clearly won't work:
pred <-predict(mod_lm, newdata = forecast_df)
Thank you in advance, any idea on the matter is appreciated.
Using my_df from the question convert it to zoo and then run dyn$lm. The original problem was that there was a datetime field that was character so zoo(my_df) converted it to a character object. If we use read.zoo and use it to convert the datetime field to POSIXct and tell it that that column is the index -- read.zoo assumes that the first column is the index unless specified otherwise -- then it works. Also note that there are date times in the data that conflict with daylight savings time so the time zone must be specified as UTC.
Next follow the code at the end of ?dyn
to get predictions. Since this can be very slow with large amounts of data we have just shown 3 predictions below.
As per my comment be sure that dplyr is not loaded since it clobbers lag.
library(dyn)
z <- read.zoo(my_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
fo <- LOAD ~ TEMPERATURE + MONTH + WEEKDAY + HOUR + lag(LOAD, -(1:24))
fm <- dyn$lm(fo, z)
fc.orig <- read.zoo(forecast_df, format = "%d/%m/%Y %H:%M", tz = "UTC")
# fc <- fc.orig
fc <- head(fc.orig, 3)
LOAD0 <- zoo(cbind(LOAD = 0), time(fc))
both <- rbind(z, cbind(LOAD0, fc))
for(i in seq(nrow(z) + 1, nrow(both))) {
fit <- dyn$lm(fo, both, subset = seq_len(i-1))
both[i, "LOAD"] <- tail(predict(fit, both[1:i, ]), 1)
}
# extract the forecast rows
fc_new <- both[seq(nrow(z) + 1, nrow(both)), ]
my_df <- read.csv(file =
"https://raw.githubusercontent.com/Argiro1983/Load/main/my_df.csv", sep=";")
forecast_df <- read.csv(file = "https://raw.githubusercontent.com/Argiro1983/Load/main/forecast_df.csv", sep=";")