pythonnumpytensorflowkerasregression

Why do model.evaluate() vs. manual loss computation with model.predict() in tf.keras do not add up?


I use keras and tensorflow to train a 'simple' Multilayer Perceptron (MLP) for a regression task, where I use the mean-squared error (MSE) as loss-function. I denote my training data as x_train, y_train and my test data as x_test, y_test. I recognized the following: For A and B defined as follows:

  1. A = model.evaluate(x_test, y_test) and
  2. B = loss(pred_test, y_test), where pred_test = model.predict(x_test) are the out-of-sample predictions obtained from my model,

the values for A and B are (slightly) different. My question is where the difference comes from and what I can do, such that the values coincide. Below I give a minimal reproducible example in which I tried to find the answer myself (without success). My first suspicion was that this is caused by the batchwise computation, after some experimentation with the batch-sizes, this does not seem to be the case. There are related questions on this website, but the answer to this question about the same(?) problem seems to be specific to CNNs. The discussion in this post asserts that the difference is caused by the batch-wise evaluation in model.evaluate, but 1.) I really do not see how the choice of the batch-size should affect the result since in the end the average is build anyway and 2.) even if setting the batch-size to the number of samples the results are still different. This is even the case in the answer to the beformentioned post. Last, there is this thread, where the problem seems to caused by the property of the metric that it actually is variant w.r.t. to batch-sizes. However, this is not the case for the MSE!

Here is the minimal example where I train a regression function on simulations:

import tensorflow as tf
import keras
import numpy as np
import random as random # for sims and seed setting

random.seed(10)

x = np.random.normal([0, 1, 2], [2,1,4], (200, 3))
y = x[:,0] + 0.01 * np.power(x[:,1], 2) + np.sqrt(np.abs(x[:,2] - 3)) + np.random.normal(0, 1, (200))
y = y[:,np.newaxis]

x_train = x[0:100,:]
y_train = y[0:100,:]
x_test = x[101:200,:]
y_test = y[101:200,:]

# MSE
def MSE(a,b):
    return tf.reduce_mean(tf.pow(a - b, 2))

# layers
Inputs_MLP = tf.keras.Input(batch_shape = (100,3), dtype = tf.float32)
Layer1_MLP = tf.keras.layers.Dense(16)(Inputs_MLP)
Outputs_MLP = tf.keras.layers.Dense(1)(Layer1_MLP)

# keras model
model_MLP = tf.keras.Model(Inputs_MLP, Outputs_MLP)
model_MLP.compile(loss = MSE)
history = model_MLP.fit(x = x_train, y = y_train, epochs=5, batch_size = 25)

# evaluation

# out-of-sample
model_MLP.evaluate(x_test, y_test, 100)
# 5.561294078826904
pred_MLP_test = model_MLP.predict(x_test, batch_size = 100)
MSE(pred_MLP_test, y_test)
# <tf.Tensor: shape=(), dtype=float64, numpy=5.561294010797092>

# in-sample
model_MLP.evaluate(x_train, y_train, 100)
# 5.460160732269287
pred_MLP_train = model_MLP.predict(x_train, batch_size = 100)
MSE(pred_MLP_train, y_train)
# <tf.Tensor: shape=(), dtype=float64, numpy=5.46016054713104>

The out-of-sample evaluation yields 5.561294078826904 once and on the other hand 5.561294010797092. For this example it is only a slight difference, but it still bugs me. Also, for another (longer and more complicated) example the difference is bigger. I would appreciate any help!


Solution

  • Keras operates on float32 datatypes, that's what you see when you use model.evaluate(). However, when you compute MSE using your custom function, you're computing them using float64 because your y is float64.

    You'll see same values if you cast y into float32, something like this:

    # out-of-sample
    eval_loss = model_MLP.evaluate(x_test, y_test, batch_size=100)
    print(f"model.evaluate (test): {eval_loss}")
    
    pred_MLP_test = model_MLP.predict(x_test, batch_size=100)
    
    manual_mse_f64 = MSE(pred_MLP_test, y_test)
    print(f"Manual MSE (preds:f32, y:f64): {manual_mse_f64}")
    
    manual_mse_f32 = MSE(pred_MLP_test, tf.cast(y_test, tf.float32))
    print(f"Manual MSE (preds:f32, y:f32): {manual_mse_f32}")
    

    This gives:

    1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 110ms/step - loss: 23.0835
    model.evaluate (test): 23.0834903717041
    1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 62ms/step
    Manual MSE (preds:f32, y:f64): 23.08349212393938
    Manual MSE (preds:f32, y:f32): 23.0834903717041