pythontensorflowkerasdeep-learningimage-classification

Different results when only test dataset is loaded using `keras.utils.image_dataset_from_directory`


I used the following lines to fetch my test dataset:

test_ds = keras.utils.image_dataset_from_directory(img_path, image_size=image_size, batch_size = batch_size)

When I run my model on this, I get the following stats: Accuracy = 0.5214362272240086, Precision = 0.5950113378684807, F1-score = 0.5434962717481359

However, when I load my dataset in this way:

_, new_images = keras.utils.image_dataset_from_directory(img_path, shuffle=True, subset="both", seed=1, validation_split=0.9999, image_size=image_size, batch_size = batch_size)

The performance stats are: Accuracy = 0.9635388739946381, Precision = 0.9658291457286432, F1-score = 0.96875

Why would this be happening? Any similar experience?

Edit- code to get the above metrics:

predict = model.predict(new_images)
actual = tf.concat([y for x, y in new_images], axis=0).numpy().tolist()

# Get optimal threshold
fpr, tpr, thresholds = sklearn.metrics.roc_curve(actual, predict)

# Youden's index
J = tpr - fpr

# Optimal threshold
threshold = thresholds[np.argmax(J)]

# Use threshold
predicted = [1 if res > threshold else 0 for res in predict]

# Metrics
print(sklearn.metrics.accuracy_score(actual, predicted), sklearn.metrics.f1_score(actual, predicted), sklearn.metrics.precision_score(actual, predicted)

Solution

  • The reason why you get so different results is because in test_ds, the dataset gets treated as a training set and therefore gets shuffled everytime the dataset gets iterated. In new_images, the dataset is explicitly retrieved as a validation set and is only shuffled once, and then never again.

    So when you call

    predict = model.predict(new_images)
    actual = tf.concat([y for x, y in new_images], axis=0).numpy().tolist()
    

    the dataset gets iterated through two times: one time with the predict call, and one time with the actual call. For the new_images dataset, you get the same order for both calls. But for the test_ds dataset, you get a different order of samples for predict and actual, and with it a labeling missmatch. This explains the bad accuracy.


    I tested it with the following piece of code. You need an image folder for that, I don't know how to mock that (I had 13 images in there):

    df = tf.keras.preprocessing.image_dataset_from_directory(
        r'path//to//images',
        labels=list(range(13)), #  mock labels, every image has another label
        shuffle=True, seed=1
    )
    _, df2 = tf.keras.preprocessing.image_dataset_from_directory(
        r'path//to//images',
        labels=list(range(13)), #  mock labels, every image has another label
        shuffle=True, seed=1,
        subset='both', validation_split=.999
    )
    print([x[1] for x in df][0].numpy())
    # prints: [ 7  5  8  0 11  1  4  6  3  2 10  9 12]
    print([x[1] for x in df][0].numpy())
    # prints: [ 8  9  6  0  7 12 10  5  4  3  2  1 11]
    print([x[1] for x in df2][0].numpy())
    # prints: [ 3  4 10  1  6  0  7 12  9  8 11  5]
    print([x[1] for x in df2][0].numpy())
    # prints: [ 3  4 10  1  6  0  7 12  9  8 11  5]
    

    Note how the label order changes for 2 calls for df, but not for df2. That means, the predictions and labels in your examples would be different images in test_ds. The seed only fixes the order of (different) randomizings for test_ds, so you'll always get [ 7 5 8...] first and [ 8 9 6..] second for this example.

    (PS: This is a good example why more than one line of code in the question is mostly a good idea, otherwise I would not have come up with the solution ;) )