I am creating a machine learning model to classify images, and I am creating my datasets. I have a folder that contains my train, test, and validation datasets (train_ds, test_ds, val_ds respectively). I then define data augmentation as follows:
tf.random.set_seed(42)
data_augmentation = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical"),
layers.experimental.preprocessing.RandomRotation(0.2),
layers.experimental.preprocessing.RandomZoom(
height_factor=(-0.3, -0.03),
width_factor=None),
])
and I change my train_ds to account for data augmentation as follows. This code also shuffles the train_ds and applies autotuning for all ds:
AUTOTUNE = tf.data.AUTOTUNE
resize_and_rescale = keras.Sequential([
layers.Resizing(224, 224),
layers.Rescaling(1./255)
])
def prepare(ds, shuffle=False, augment=False):
# Resize and rescale all datasets.
ds = ds.map(lambda x, y: (resize_and_rescale(x), y),
num_parallel_calls=AUTOTUNE)
if shuffle:
ds = ds.shuffle(1000)
# Use data augmentation only on the training set.
if augment:
ds = ds.map(lambda x, y: (data_augmentation(x, training=True), y),
num_parallel_calls=AUTOTUNE)
# Use buffered prefetching on all datasets.
return ds.prefetch(buffer_size=AUTOTUNE)
train_ds = prepare(train_ds, shuffle=True, augment=True)
val_ds = prepare(val_ds)
test_ds = prepare(test_ds)
When I check to see my train_ds size after augmentation, it remains the original size it was before augmentation. Shouldn't it be at least 4 times the original size, as 3 data augmentation methods were applied?
Note 1: I define my datasets as follows:
import tensorflow as tf
# Define data directories
train_dir = 'MyDatabase/train/'
val_dir = 'MyDatabase/val/'
test_dir = 'MyDatabase/test/'
# Generate datasets
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
train_dir,
batch_size=batch_size,
)
val_ds = tf.keras.preprocessing.image_dataset_from_directory(
val_dir,
batch_size=batch_size,
)
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
test_dir,
batch_size=batch_size,
)
Note 2: I do not want to add my data_augmentation inside my model's layers, as I want to save the datasets in folders separately rather than apply augmentation every time I run my model.
To see my dataset size, I first use a back-of-the envelope calculation, where I multiply cardinality by batch size:
train_size = tf.data.experimental.cardinality(train_ds).numpy()
print(train_size*batch_size)
val_size = tf.data.experimental.cardinality(val_ds).numpy()
print(val_size*batch_size)
test_size = tf.data.experimental.cardinality(test_ds).numpy()
print(test_size*batch_size)
The number of images in train_ds is same before and after augmentation.
When I merge the train_ds with an augmneted_ds as follows, I only get double the size rather than 4 times the original dataset size.
augmented_ds = prepare(train_ds, shuffle=True, augment=True) # augmnet the train_ds
train_ds = prepare(train_ds, shuffle=True, augment=False) #keep the original train_ds
val_ds = prepare(val_ds)
test_ds = prepare(test_ds)
combined_ds = tf.data.Dataset.concatenate(augmented_ds, train_ds)
And I feel I should not be combining the datasets.
This will not increase the size:
data_augmentation = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical"),
layers.experimental.preprocessing.RandomRotation(0.2),
layers.experimental.preprocessing.RandomZoom(
height_factor=(-0.3, -0.03),
width_factor=None),
])
It uses the output of the previous step as input in the next step. So the final output is a single dataframe with the samesize. Note: Sequential groups a linear stack of layers into a Model.
you could create 3 different data_augmenatations like this:
data_augmentation_1 = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical")
])
data_augmentation_2 = tf.keras.Sequential([
layers.experimental.preprocessing.RandomRotation(0.2),
])
data_augmentation_3 = tf.keras.Sequential([
layers.experimental.preprocessing.RandomZoom(
height_factor=(-0.3, -0.03),
width_factor=None),
])
Then you should modify the function prepare accordingly to contcat these outputs into a single dataframe.
Something like this:
if augment:
ds1 = ds.map(lambda x, y: (data_augmentation_1(x, training=True), y),
num_parallel_calls=AUTOTUNE)
ds2 = ds.map(lambda x, y: (data_augmentation_2(x, training=True), y),
num_parallel_calls=AUTOTUNE)
ds3 = ds.map(lambda x, y: (data_augmentation_3(x, training=True), y),
num_parallel_calls=AUTOTUNE)
combined_df = pd.concat([df1, df2, df3], axis=0)