kerassubclasstfrecord

Why the combination of model subclassing and TFRecord does not work?


The short version of the question

Why when I try to train a model which is implemented through subclassing (in Keras) using a dataset that is saved and loaded by TFRecord, it fails?

The complete version of the question

I have the following model (first let's define in it functional API):

def get_model():
    input_layer = Input(shape=(6,), name="input")

    x = input_layer

    x = layers.Dense(128, activation='relu', name="dense_1")(x)
    x = layers.Dense(1024, activation='relu', name="dense_2")(x)
    x = layers.Dense(5120, activation='relu', name="dense_3")(x)

    a_out = layers.Dense(17, activation='softmax', name='a_out')(x)
    b_out = layers.Dense(27, activation='softmax', name='b_out')(x)
    c_out = layers.Dense(71, activation='softmax', name='c_out')(x)
    d_out = layers.Dense(29, activation='softmax', name='d_out')(x)

    model = models.Model(input_layer, [a_out, b_out, c_out, d_out])

    model.compile(optimizer='rmsprop',
                  loss=('sparse_categorical_crossentropy',
                        'sparse_categorical_crossentropy',
                        'sparse_categorical_crossentropy',
                        'sparse_categorical_crossentropy'))
    
    return model

It takes in a tensor of shape (6,) and outputs 4 different outputs, a_out, b_out, c_out, and d_out. Each of which is an integer (categorical output). Next I'm going to define a dummy/random dataset to train this model with:

sample_count = 1000
inputs = np.random.rand(sample_count, 6).astype(np.float32)
targets = (
    np.random.randint(low=0, high=16, size=(sample_count,), dtype=np.int64),
    np.random.randint(low=0, high=26, size=(sample_count,), dtype=np.int64),
    np.random.randint(low=0, high=70, size=(sample_count,), dtype=np.int64),
    np.random.randint(low=0, high=28, size=(sample_count,), dtype=np.int64)
)
random_dataset = tf.data.Dataset.from_tensor_slices((inputs, targets))

for rec in random_dataset:
    print(rec)
    break

If you call the functional API model's fit method and provide it with this dataset, it will train just fine. Also, the print statement from the previous code block outputs something like this:

(<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.326234  , 0.9935627 , 0.65569717, 0.05908937, 0.7490394 ,
       0.7929646 ], dtype=float32)>, (<tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([60])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([9])>))

Now, let's save and load the same dataset using TFRecord:

# Saving the random dataset into a TFRecord file
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    # If the value is an eager tensor BytesList won't unpack a string from an EagerTensor.
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() 
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

file_path = 'random.tfrec'
with tf.io.TFRecordWriter(file_path) as writer:
  for rec in random_dataset:
    feature = {
      'input': _bytes_feature(tf.io.serialize_tensor(rec[0])),
      'a_out': _int64_feature(rec[1][0]),
      'b_out': _int64_feature(rec[1][1]),
      'c_out': _int64_feature(rec[1][2]),
      'd_out': _int64_feature(rec[1][3]),
    }

    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    writer.write(example_proto.SerializeToString())

# Load the dataset off the file just created
def read_tfrecord(serialized_example):
    feature_description = {
        'input': tf.io.FixedLenFeature((), tf.string),
        'a_out': tf.io.FixedLenFeature((), tf.int64),
        'b_out': tf.io.FixedLenFeature((), tf.int64),
        'c_out': tf.io.FixedLenFeature((), tf.int64),
        'd_out': tf.io.FixedLenFeature((), tf.int64)
    }

    example = tf.io.parse_single_example(serialized_example, feature_description)

    return tf.io.parse_tensor(example['input'], out_type=tf.float32), (
        example["a_out"],
        example["b_out"],
        example["c_out"],
        example["d_out"])

tfrecord_dataset = tf.data.TFRecordDataset(file_path).map(read_tfrecord)

for rec in tfrecord_dataset:
    print(rec)
    break

The last print statement is just a sanity check to make sure that the dataset was not distorted during the serialization process. It outputs something like:

(<tf.Tensor: shape=(6,), dtype=float32, numpy=
array([0.326234  , 0.9935627 , 0.65569717, 0.05908937, 0.7490394 ,
       0.7929646 ], dtype=float32)>, (<tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([5])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([60])>, <tf.Tensor: shape=(1,), dtype=int64, numpy=array([9])>))

which is identical in every way to the original dataset. And if I feed this tfrecord_dataset dataset to the functional API model, it still trains just fine. Next, I'm going to define the same model using inheritance (A.K.A. subclassing):

class SubclassModel(keras.Model):
    def __init__(self):
        super(SubclassModel, self).__init__()

        self.d1 = layers.Dense(128, activation='relu', name="dense_1")
        self.d2 = layers.Dense(1024, activation='relu', name="dense_2")
        self.d3 = layers.Dense(5120, activation='relu', name="dense_3")

        self.a_out = layers.Dense(17, activation='softmax', name='a_out')
        self.b_out = layers.Dense(27, activation='softmax', name='b_out')
        self.c_out = layers.Dense(71, activation='softmax', name='c_out')
        self.d_out = layers.Dense(29, activation='softmax', name='d_out')

        self.build((None, 6,))
        self.compile(optimizer='rmsprop',
                     loss=('sparse_categorical_crossentropy',
                           'sparse_categorical_crossentropy',
                           'sparse_categorical_crossentropy',
                           'sparse_categorical_crossentropy'))

    def call(self, inputs, training=True):
        x = inputs
        
        x = self.d1(x)
        x = self.d2(x)
        x = self.d3(x)
        
        a = self.a_out(x)
        b = self.b_out(x)
        c = self.c_out(x)
        d = self.d_out(x)

        return a, b, c, d

Here's the punch line. Now, I have two different ways of creating the model (functional API and inheritance) and two different datasets (random_dataset and tfrecord_dataset). That makes up four different combinations:

  1. Training functional API model using random_dataset: works fine
  2. Training functional API model using tfrecord_dataset: works fine
  3. Training SubclassModel using random_dataset: works fine
  4. Training SubclassModel using tfrecord_dataset: fails!

Here's the error I face (truncated):

TypeError: in user code:

    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 878, in train_function  *
        return step_function(self, iterator)
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 867, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 860, in run_step  **
        outputs = model.train_step(data)
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/engine/training.py", line 808, in train_step
        y_pred = self(x, training=True)
    File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
        raise e.with_traceback(filtered_tb) from None

    TypeError: Exception encountered when calling layer "subclass_model_1" (type SubclassModel).
    
    in user code:
    
        File "/tmp/ipykernel_22298/1542980101.py", line 28, in call  *
            a = self.a_out(x)
        File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler  **
            raise e.with_traceback(filtered_tb) from None
        File "/home/mehran/.pyenv/versions/3.8.12/envs/jupyter/lib/python3.8/site-packages/keras/activations.py", line 78, in softmax
            if x.shape.rank > 1:
    
        TypeError: Exception encountered when calling layer "a_out" (type Dense).
        
        '>' not supported between instances of 'NoneType' and 'int'
        
        Call arguments received:
          • inputs=tf.Tensor(shape=<unknown>, dtype=float32)
    
    
    Call arguments received:
      • inputs=tf.Tensor(shape=<unknown>, dtype=float32)
      • training=True

Does anyone have any idea what am I doing wrong?


Solution

  • For anyone else who might be facing the same problem, the solution is to reshape the tensors to match their intended shape when you are reading the TFRecords:

    def read_tfrecord(serialized_example):
        feature_description = {
            'input': tf.io.FixedLenFeature((), tf.string),
            'a_out': tf.io.FixedLenFeature((), tf.int64),
            'b_out': tf.io.FixedLenFeature((), tf.int64),
            'c_out': tf.io.FixedLenFeature((), tf.int64),
            'd_out': tf.io.FixedLenFeature((), tf.int64)
        }
    
        example = tf.io.parse_single_example(serialized_example, feature_description)
    
        return tf.reshape(tf.io.parse_tensor(example['input'], out_type=tf.float32), (6,)), (
            example["a_out"],
            example["b_out"],
            example["c_out"],
            example["d_out"])
    

    Why the functional API does not complain about this but the sub-classing does, is beyond me.