pythontensorflowmachine-learningkerasdata-generation

Keras custom generator when batch_size doesn't match with amount of data


I'm using Keras with Python 2.7. I'm making my own data generator to compute batches for the train. I have some question about data_generator based on this model seen here:

class DataGenerator(keras.utils.Sequence):

def __init__(self, list_IDs, ...):
    #init

def __len__(self):
    return int(np.floor(len(self.list_IDs) / self.batch_size))

def __getitem__(self, index):
    indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
    # Find list of IDs
    list_IDs_temp = [self.list_IDs[k] for k in indexes]
    # Generate data
    X, y = self.__data_generation(list_IDs_temp)
    return X, y

def on_epoch_end(self):
    'Updates indexes after each epoch'
    self.indexes = np.arange(len(self.list_IDs))
    if self.shuffle == True:
        np.random.shuffle(self.indexes)

def __data_generation(self, list_IDs_temp):
    #generate data
    return X, y

Okay, so here are my several questions :

Can you confirm my thinking about the order of function called ? Here is :

- __init__
- loop for each epoc :
    - loop for each batches :
        - __len_
        - __get_item__ (+data generation)
    - on_epoch_end

If you know a way to debug the generator I would like to know it, breakpoint and prints aren't working with this..

More, I have a bad situation, but I think that everybody have the problem :

For example, I have 200 datas (and 200 labels ok) and I want a batch size of 64 for example. If I'm thinking well, __len_ will give 200/64 = 3 (instead of 3,125). So 1 epoch will be done with 3 batches ? What about the rest of the data ? I have an error because my amount of data is not a multiple of the batch size...

Second example, I have 200 data and I want a batch of 256 ? What I have to do in this case to adapt my generator ? I thought about checking if the batch_size is superior to my amount of data to feed the CNN with 1 batch, but the batch will not have the expected size so I thinks it will make an error ?

Thanks you for the reading. I prefer to put pseudo-code because my questions are more about theory than coding errors !


Solution

  • Normally you never mention the batch size in the model architecture, because it is a training parameter not a model parameter. So it is OK to have different batch sizes while training.

    Example

    from keras.models import Sequential
    from keras.layers import Dense, Conv2D, Flatten
    from keras.utils import to_categorical
    import keras
    
    #create model
    model = Sequential()
    #add model layers
    model.add(Conv2D(64, kernel_size=3, activation='relu', input_shape=(10,10,1)))
    model.add(Flatten())
    model.add(Dense(2, activation='softmax'))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    class DataGenerator(keras.utils.Sequence):
        def __init__(self, X, y, batch_size):
            self.X = X
            self.y = y
            self.batch_size = batch_size
    
        def __len__(self):
            l = int(len(self.X) / self.batch_size)
            if l*self.batch_size < len(self.X):
                l += 1
            return l
    
        def __getitem__(self, index):
            X = self.X[index*self.batch_size:(index+1)*self.batch_size]
            y = self.y[index*self.batch_size:(index+1)*self.batch_size]
            return X, y
    
    X = np.random.rand(200,10,10,1)
    y = to_categorical(np.random.randint(0,2,200))
    model.fit_generator(DataGenerator(X,y,13), epochs=10)
    

    Output:

    Epoch 1/10 16/16 [==============================] - 0s 2ms/step - loss: 0.6774 - acc: 0.6097

    As you can see it has run 16 batches in one epoch i.e 13*15+5=200