I used the following lines to fetch my test dataset:
test_ds = keras.utils.image_dataset_from_directory(img_path, image_size=image_size, batch_size = batch_size)
When I run my model on this, I get the following stats: Accuracy = 0.5214362272240086, Precision = 0.5950113378684807, F1-score = 0.5434962717481359
However, when I load my dataset in this way:
_, new_images = keras.utils.image_dataset_from_directory(img_path, shuffle=True, subset="both", seed=1, validation_split=0.9999, image_size=image_size, batch_size = batch_size)
The performance stats are: Accuracy = 0.9635388739946381, Precision = 0.9658291457286432, F1-score = 0.96875
Why would this be happening? Any similar experience?
Edit- code to get the above metrics:
predict = model.predict(new_images)
actual = tf.concat([y for x, y in new_images], axis=0).numpy().tolist()
# Get optimal threshold
fpr, tpr, thresholds = sklearn.metrics.roc_curve(actual, predict)
# Youden's index
J = tpr - fpr
# Optimal threshold
threshold = thresholds[np.argmax(J)]
# Use threshold
predicted = [1 if res > threshold else 0 for res in predict]
# Metrics
print(sklearn.metrics.accuracy_score(actual, predicted), sklearn.metrics.f1_score(actual, predicted), sklearn.metrics.precision_score(actual, predicted)
The reason why you get so different results is because in test_ds
, the dataset gets treated as a training set and therefore gets shuffled everytime the dataset gets iterated. In new_images
, the dataset is explicitly retrieved as a validation set and is only shuffled once, and then never again.
So when you call
predict = model.predict(new_images)
actual = tf.concat([y for x, y in new_images], axis=0).numpy().tolist()
the dataset gets iterated through two times: one time with the predict
call, and one time with the actual
call. For the new_images
dataset, you get the same order for both calls. But for the test_ds
dataset, you get a different order of samples for predict
and actual
, and with it a labeling missmatch. This explains the bad accuracy.
I tested it with the following piece of code. You need an image folder for that, I don't know how to mock that (I had 13 images in there):
df = tf.keras.preprocessing.image_dataset_from_directory(
r'path//to//images',
labels=list(range(13)), # mock labels, every image has another label
shuffle=True, seed=1
)
_, df2 = tf.keras.preprocessing.image_dataset_from_directory(
r'path//to//images',
labels=list(range(13)), # mock labels, every image has another label
shuffle=True, seed=1,
subset='both', validation_split=.999
)
print([x[1] for x in df][0].numpy())
# prints: [ 7 5 8 0 11 1 4 6 3 2 10 9 12]
print([x[1] for x in df][0].numpy())
# prints: [ 8 9 6 0 7 12 10 5 4 3 2 1 11]
print([x[1] for x in df2][0].numpy())
# prints: [ 3 4 10 1 6 0 7 12 9 8 11 5]
print([x[1] for x in df2][0].numpy())
# prints: [ 3 4 10 1 6 0 7 12 9 8 11 5]
Note how the label order changes for 2 calls for df
, but not for df2
. That means, the predictions and labels in your examples would be different images in test_ds
. The seed only fixes the order of (different) randomizings for test_ds
, so you'll always get [ 7 5 8...]
first and [ 8 9 6..]
second for this example.
(PS: This is a good example why more than one line of code in the question is mostly a good idea, otherwise I would not have come up with the solution ;) )