Suppose I create a sequential input LSTM in Tensorflow along the lines of:
def Sequential_Input_LSTM(df, input_sequence):
df_np = df.to_numpy()
X = []
y = []
for i in range(len(df_np) - input_sequence):
row = [a for a in df_np[i:i + input_sequence]]
X.append(row)
label = df_np[i + input_sequence]
y.append(label)
return np.array(X), np.array(y)
X, y = Sequential_Input_LSTM(df_data , 10) # pandas DataFrame df_data contains our data
In this example, I slice my data in X
(input vector) and y
(labels) in such a way that e.g. the first 10 values (sequence length) serve as X and the 11th value is the first y. Then, the window of 10 values is moved one step to the right (one timestep further) and we take again 10 values for X
and the value after this second row as the next y
, and so on.
Then suppose I take a part of X
as my X_test
, and use a LSTM model
to make a time-series prediction, like predictions = model.predict(X_test)
.
When I actually tried this, and plotted the results from predict(X_test)
, it looks like the y
array and the predictions results are synchronized without further adjustments. I expected that I would have to shift the prediction array manually 10 timesteps to the right when plotting it together with the labels, since I cannot explain where the first 10 timestamps of prediction come from.
Where do the predictions for the first 10 timesteps of X_test
come from, seeing as the model has not received 10 input sequence values yet? Does Tensorflow use the last timesteps in X_test
to create the predictions of the first 10 values, or are the predictions at the beginning just pure guesses?
If I get it right, the problem is that the first 10 timesteps from X_test
use the last 10 timesteps from X
(or more precise, X_train
) for the predicition. With big enough X_test
, this does not make much difference, but is theoretically data leakage from the training set to the test set.
I demonstrate it with a small example (correct me if I'm wrong):
df_data = [0, 1, 2, .., 15] # len 16
window_size = 3
X = [[0,1,2], [1,2,3], [2,3,4], ..., [12,13,14]] # len 13
y = [3, 4, 5, .., 15] # len 13
# split the data 10-3 for train-test
X_train = [[0,1,2], [1,2,3], [2,3,4], ..., [9,10,11]]
y_train = [3, 4, 5, .., 12]
X_test = [[10,11,12], [11,12,13], [12,13,14]]
y_test = [13, 14, 15]
The problem in this example is that 10 and 11 are both used in sequences for X_train
and X_test
. So you have to first split df_data
into train/test (without shuffling) and then do the sequencing separately. With this, you'd lose the first n-th values for y
in both train and test.
Edit: To clarify how the actual split would look like for a bit smaller X_train
(the X_train
from above is too big to leave predictions for X_test
without leakage):
df_data = [0, 1, 2, .., 15] # len 16
window_size = 3
train = [0, 1, 2, .., 10]
test = [11, 12, .., 15]
# make windows
X_train = [[0,1,2], [1,2,3], [2,3,4], ..., [7,8,9]]
y_train = [3, 4, 5, .., 10]
X_test = [[11,12,13], [12,13,14]]
y_test = [14,15]
Here, no values that are present in X_train
or y_train
are present in either X_test
or y_test
. The first 3 values in both train
and test
can not be predicted and are only used for windowing.