NaNs with customised weighted F1-Score in Keras

I need to compute a weighted F1-score in such a way to penalize more errors over my least popular label (typical binary classification problem with an unbalanced dataset). Unfortunately, I don't get a valid F1-score. The followings are my metrics functions:

def sensitivity(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    return true_positives / (possible_positives + K.epsilon())

def specificity(y_true, y_pred):
    true_negatives = K.sum(K.round(K.clip((1-y_true) * (1-y_pred), 0, 1)))
    possible_negatives = K.sum(K.round(K.clip(1-y_true, 0, 1)))
    return true_negatives / (possible_negatives + K.epsilon())

def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall)) 

model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(0.001),
              metrics=[sensitivity, specificity, 'accuracy', f1])

and here I train the model and do evaluation:

model.fit(x_train, y_train, epochs=12, batch_size=32, verbose=1, class_weight=class_weights_dict, validation_split=0.3)
classes = model.predict(x_test)
loss_and_metrics = model.evaluate(x_test, y_test, batch_size=128, verbose=1)

I always get nan as f1score - is something wrong conceptually or programmatically? Because data are the same I used with another classifier of the scikit-learn library (SVM) and it succeeded.

These are results:

Epoch 1/12
5133/5133 [==============================] - 5s 976us/step - loss: 0.6955 - sensitivity: 0.0561 - specificity: 0.9377 - acc: 0.8712 - f1: nan - val_loss: 0.6884 - val_sensitivity: 0.8836 - val_specificity: 0.0000e+00 - val_acc: 0.0723 - val_f1: nan
Epoch 2/12
5133/5133 [==============================] - 5s 894us/step - loss: 0.6954 - sensitivity: 0.3865 - specificity: 0.5548 - acc: 0.5398 - f1: nan - val_loss: 0.6884 - val_sensitivity: 0.0000e+00 - val_specificity: 1.0000 - val_acc: 0.9277 - val_f1: nan
Epoch 3/12
5133/5133 [==============================] - 5s 925us/step - loss: 0.6953 - sensitivity: 0.3928 - specificity: 0.5823 - acc: 0.5696 - f1: nan - val_loss: 0.6884 - val_sensitivity: 0.0000e+00 - val_specificity: 1.0000 - val_acc: 0.9277 - val_f1: nan
Epoch 4/12
5133/5133 [==============================] - 5s 935us/step - loss: 0.6954 - sensitivity: 0.1309 - specificity: 0.8504 - acc: 0.7976 - f1: nan - val_loss: 0.6884 - val_sensitivity: 0.0000e+00 - val_specificity: 1.0000 - val_acc: 0.9277 - val_f1: nan
etc.

Final result:

[0.6859536773606656, 0.0, 1.0, 0.9321705426356589, nan]

Solution

Regarding the nan in your f1 metric:

If you look at the log, your validation sensitivity is 0. Which means your precision and recall are both zero as well. So in the f1 calculation you are dividing by zero and getting a nan.

Add K.epsilon(), as you have done in the other functions.

On a side note, judging by your loss, which had a negligible improvement on the train set, your network had learnt nothing. I'd advice you to start by increasing the number of epochs, make the network deeper and don't pass anything to the class_weight argument (you mention not using weighted computation yet, but your code does set some class weight).