I have implemented a CNN-based regression model that uses a data generator to use the huge amount of data I have. Training and evaluation work well, but there's an issue with the prediction. If for example I want to predict values from a test dataset of 50 samples, I use model.predict with a batch size of 5. The problem is that model.predict returns 5 values repeated 10 times, instead of 50 different values . The same thing happens if I change to batch size to 1, it will return one value 50 times.
To solve this issue, I used a full batch size (50 in my example), and it worked. But I can't I use this method on my whole test data because it's too huge.
Do you have any other solution, or what is the problem in my approach?
My data generator code:
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, data_X, data_Z, target_y batch_size=32, dim1=(120,120),
dim2 = 80, n_channels=1, shuffle=True):
'Initialization'
self.dim1 = dim1
self.dim2 = dim2
self.batch_size = batch_size
self.data_X = data_X
self.data_Z = data_Z
self.target_y = target_y
self.list_IDs = list_IDs
self.n_channels = n_channels
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim1, self.n_channels))
Z = np.empty((self.batch_size, self.dim2))
y = np.empty((self.batch_size))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('data/' + data_X + ID + '.npy')
Z[i,] = np.load('data/' + data_Z + ID + '.npy')
# Store target
y[i] = np.load('data/' + target_y + ID + '.npy')
How I call model.predict()
predict_params = {'list_IDs': 'indexes',
'data_X': 'images',
'data_Z': 'factors',
'target_y': 'True_values'
'batch_size': 5,
'dim1': (120,120),
'dim2': 80,
'n_channels': 1,
'shuffle'=False}
# Prediction generator
prediction_generator = DataGenerator(test_index, **predict_params)
predition_results = model.predict(prediction_generator, steps = 1, verbose=1)
If we look at your __getitem__
function, we can see this code:
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
This code will always return the same numbers IDs, because the length len
of the indexes is always the same (at least as long as all batches have an equal amount of samples) and we just loop over the first couple of indexes every time.
You are already extracting the indexes of the current batch beforehand, so the line with the error is not needed at all. The following code should work:
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
list_IDs_temp = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
See if this code works and you get different results. You should now get bad predictions, because during training, your model would also only have trained on the same few data points as of now.