pythonmachine-learningkeraslstmoverfitting-underfitting

Can the validation data be used in model.fit for prediction?


I am trying to build an LSTM model to forecast stock prices. I have split the data into training and test. I'm using the test data inside model.fit as validation_data. After that I pass the test data to model.predict() and generate the forecasts.

I am wondering, if I use the test data in model.fit(), would overfitting occur given that I use the same set of data to generate the forecasts?

Should I split the raw data into 3 sets instead: training, validation and test? The validation data would be used in model.fit() whilst the test data would be used in model.predict().

Sample code:

model_lstm = Sequential()
model_lstm.add(LSTM(50, return_sequences = True, input_shape = (X_train.shape[1], X_train.shape[2])))
model_lstm.add(LSTM(units=50, return_sequences=True))
model_lstm.add(LSTM(units=50, return_sequences=True))
model_lstm.add(LSTM(units=50))
model_lstm.add(Dense(units=1, activation='relu'))
model_lstm.compile(loss = 'mse', optimizer = 'adam')
model_lstm.summary()

history_lstm = model_lstm.fit(X_train, y_train, validation_data=(X_test, y_test), epochs = 10, batch_size=32, shuffle=False)

Solution

  • Usually, you would split the data into 3 sets:

    1. train set: used to train the model
    2. validation set: used for frequent evaluation of the model, allow to fine-tune hyper-parameters. MUSTN'T be used to train, as the evaluation must be the most unbiased possible.
    3. test set: final set used for the evaluation of the model.

    As indicated by the name of the argument (validation_set) you are supposed to put the validation set here.
    As you thought, allowing the model to try and "validate" the hyper-parameters on the test set could lead to overfitting.

    As for the ratio, the greater the number of hyper-parameters of your model, the bigger the validation set should be (also, look into "cross validation": this will help if the train set is too small for you to be able to take a big part of it for the validation set without impacting the performances)