pythontensorflowmachine-learningimage-classificationdata-augmentation

Data augmentation not increasing dataset size


I am creating a machine learning model to classify images, and I am creating my datasets. I have a folder that contains my train, test, and validation datasets (train_ds, test_ds, val_ds respectively). I then define data augmentation as follows:

tf.random.set_seed(42)

data_augmentation = tf.keras.Sequential([
    layers.RandomFlip("horizontal_and_vertical"),
    layers.experimental.preprocessing.RandomRotation(0.2), 
    layers.experimental.preprocessing.RandomZoom(
        height_factor=(-0.3, -0.03),
        width_factor=None), 
])

and I change my train_ds to account for data augmentation as follows. This code also shuffles the train_ds and applies autotuning for all ds:

AUTOTUNE = tf.data.AUTOTUNE

resize_and_rescale = keras.Sequential([
  layers.Resizing(224, 224),
  layers.Rescaling(1./255)
])

def prepare(ds, shuffle=False, augment=False):
  # Resize and rescale all datasets.
  ds = ds.map(lambda x, y: (resize_and_rescale(x), y),
              num_parallel_calls=AUTOTUNE)

  if shuffle:
    ds = ds.shuffle(1000)

  # Use data augmentation only on the training set.
  if augment:
    ds = ds.map(lambda x, y: (data_augmentation(x, training=True), y),
                num_parallel_calls=AUTOTUNE)

  # Use buffered prefetching on all datasets.
  return ds.prefetch(buffer_size=AUTOTUNE)


train_ds = prepare(train_ds, shuffle=True, augment=True)
val_ds = prepare(val_ds)
test_ds = prepare(test_ds)

When I check to see my train_ds size after augmentation, it remains the original size it was before augmentation. Shouldn't it be at least 4 times the original size, as 3 data augmentation methods were applied?

Note 1: I define my datasets as follows:

import tensorflow as tf

# Define data directories
train_dir = 'MyDatabase/train/' 
val_dir = 'MyDatabase/val/'
test_dir = 'MyDatabase/test/'

# Generate datasets
train_ds = tf.keras.preprocessing.image_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    )

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
    val_dir,
    batch_size=batch_size,
    )

test_ds = tf.keras.preprocessing.image_dataset_from_directory(
    test_dir,
    batch_size=batch_size,
    )

Note 2: I do not want to add my data_augmentation inside my model's layers, as I want to save the datasets in folders separately rather than apply augmentation every time I run my model.

To see my dataset size, I first use a back-of-the envelope calculation, where I multiply cardinality by batch size:

train_size = tf.data.experimental.cardinality(train_ds).numpy()
print(train_size*batch_size)

val_size = tf.data.experimental.cardinality(val_ds).numpy()
print(val_size*batch_size)

test_size = tf.data.experimental.cardinality(test_ds).numpy()
print(test_size*batch_size)

The number of images in train_ds is same before and after augmentation.

When I merge the train_ds with an augmneted_ds as follows, I only get double the size rather than 4 times the original dataset size.

augmented_ds  = prepare(train_ds, shuffle=True, augment=True) # augmnet the train_ds
train_ds = prepare(train_ds, shuffle=True, augment=False) #keep the original train_ds
val_ds = prepare(val_ds)
test_ds = prepare(test_ds)

combined_ds = tf.data.Dataset.concatenate(augmented_ds, train_ds)

And I feel I should not be combining the datasets.


Solution

  • This will not increase the size:

    data_augmentation = tf.keras.Sequential([
        layers.RandomFlip("horizontal_and_vertical"),
        layers.experimental.preprocessing.RandomRotation(0.2), 
        layers.experimental.preprocessing.RandomZoom(
            height_factor=(-0.3, -0.03),
            width_factor=None), 
    ])
    

    It uses the output of the previous step as input in the next step. So the final output is a single dataframe with the samesize. Note: Sequential groups a linear stack of layers into a Model.

    you could create 3 different data_augmenatations like this:

    data_augmentation_1 = tf.keras.Sequential([
            layers.RandomFlip("horizontal_and_vertical")
    ])
    
    data_augmentation_2 = tf.keras.Sequential([
      
            layers.experimental.preprocessing.RandomRotation(0.2), 
         
        ])
    
    data_augmentation_3 = tf.keras.Sequential([
    
            layers.experimental.preprocessing.RandomZoom(
                height_factor=(-0.3, -0.03),
                width_factor=None),
        ])
    

    Then you should modify the function prepare accordingly to contcat these outputs into a single dataframe.

    Something like this:

    if augment:
         ds1 = ds.map(lambda x, y: (data_augmentation_1(x, training=True), y),
                num_parallel_calls=AUTOTUNE)
         ds2 = ds.map(lambda x, y: (data_augmentation_2(x, training=True), y),
                num_parallel_calls=AUTOTUNE)
         ds3 = ds.map(lambda x, y: (data_augmentation_3(x, training=True), y),
                num_parallel_calls=AUTOTUNE)
    
         combined_df = pd.concat([df1, df2, df3], axis=0)