python tensorflow machine-learning keras predict

Tensorflow predict() timeseries alignment in Python

Suppose I create a sequential input LSTM in Tensorflow along the lines of:

def Sequential_Input_LSTM(df, input_sequence):
    df_np = df.to_numpy()
    X = []
    y = []
    
    for i in range(len(df_np) - input_sequence):
        row = [a for a in df_np[i:i + input_sequence]]
        X.append(row)
        label = df_np[i + input_sequence]
        y.append(label)
        
    return np.array(X), np.array(y)

X, y = Sequential_Input_LSTM(df_data , 10) # pandas DataFrame df_data contains our data

In this example, I slice my data in X (input vector) and y (labels) in such a way that e.g. the first 10 values (sequence length) serve as X and the 11th value is the first y. Then, the window of 10 values is moved one step to the right (one timestep further) and we take again 10 values for X and the value after this second row as the next y, and so on.

Then suppose I take a part of X as my X_test, and use a LSTM model to make a time-series prediction, like predictions = model.predict(X_test).

When I actually tried this, and plotted the results from predict(X_test), it looks like the y array and the predictions results are synchronized without further adjustments. I expected that I would have to shift the prediction array manually 10 timesteps to the right when plotting it together with the labels, since I cannot explain where the first 10 timestamps of prediction come from.

Where do the predictions for the first 10 timesteps of X_test come from, seeing as the model has not received 10 input sequence values yet? Does Tensorflow use the last timesteps in X_test to create the predictions of the first 10 values, or are the predictions at the beginning just pure guesses?

Solution

If I get it right, the problem is that the first 10 timesteps from X_test use the last 10 timesteps from X (or more precise, X_train) for the predicition. With big enough X_test, this does not make much difference, but is theoretically data leakage from the training set to the test set.

I demonstrate it with a small example (correct me if I'm wrong):

df_data = [0, 1, 2, .., 15]  # len 16
window_size = 3
X = [[0,1,2], [1,2,3], [2,3,4], ..., [12,13,14]]  # len 13
y = [3, 4, 5, .., 15]  # len 13
# split the data 10-3 for train-test
X_train = [[0,1,2], [1,2,3], [2,3,4], ..., [9,10,11]]
y_train = [3, 4, 5, .., 12]
X_test = [[10,11,12], [11,12,13], [12,13,14]]
y_test = [13, 14, 15]

The problem in this example is that 10 and 11 are both used in sequences for X_train and X_test. So you have to first split df_data into train/test (without shuffling) and then do the sequencing separately. With this, you'd lose the first n-th values for y in both train and test.

Edit: To clarify how the actual split would look like for a bit smaller X_train (the X_train from above is too big to leave predictions for X_test without leakage):

df_data = [0, 1, 2, .., 15]  # len 16
window_size = 3
train = [0, 1, 2, .., 10]
test = [11, 12, .., 15]
# make windows
X_train = [[0,1,2], [1,2,3], [2,3,4], ..., [7,8,9]]
y_train = [3, 4, 5, .., 10]
X_test = [[11,12,13], [12,13,14]]
y_test = [14,15]

Here, no values that are present in X_train or y_train are present in either X_test or y_test. The first 3 values in both train and test can not be predicted and are only used for windowing.