How to use TF CTC loss with variable length features and labels

I want to implement with Tensorflow a speech recognizer with CTC loss. The input features have variable lenghts because each speech utterance can have variable length. The labels also have variable length because each transcription is different. I manually pad the features to create the batches and in my model I have tf.keras.layers.Masking() layer to create and propagate the mask through the network. I also create the labels batch with padding.

Here is a dummy example. Let's imagine that I have two utterances of length 3 and 5 frames respectively. Each frame is represented by one single feature (normally this would be 13 MFCCs but I reduce it to one to keep it simple). So to create the batch I pad the short utterance with 0 at the end:

features = np.array([1.5 2.3 4.6 0.0 0.0],
                    [1.7 2.6 3.4 2.3 1.0])

The labels are the transcription of these utterances. Let's say that the lengths are 2 and 3 respectively. The labels batch shape will be [2, 3, 26], where 2 in the batch size, 3 is the maximum length and 26 is the number of character in English (one-hot encoding).

The model is:

input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(26, return_sequences=True)(input_)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)

The loss function is something like:

def ctc_loss(y_true, y_pred):
   # Do something here to get logit_length and label_length?
   # ...
   loss = tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)

My question is how to get logit_length and label_length. I would suppose that logit_length is encoded in the mask, but if I do y_pred._keras_mask, the result is None. For label_length, the information is in the tensor itself, but I'm not sure of the most efficient way of getting it.

Thanks.

UPDATE:

Following Tou You's answer, I use tf.math.count_nonzero to get the label_length, and I set logit_length to the length of the logit layer.

So the shapes inside the loss function are (batch size = 10):

y_true.shape = (10, None)
y_pred.shape = (10, None, 27)
label_length.shape = (10,1)
logit_lenght.shape = (10,1)

Of course the 'None' of y_true and y_pred are not the same, since one is the maximum string length of the batch and the other is the maximum number of time frames of the batch. However, when I call model.fit() and in the loss tf.keras.backend.ctc_batch_cost() with those parameters, I get the error:

Traceback (most recent call last):
  File "train.py", line 164, in <module>
    model.fit(dataset, batch_size=batch_size, epochs=10)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
    tmp_logs = train_function(iterator)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
    return self._call_flat(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
    outputs = execute.execute(
  File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Incompatible shapes: [10,92] vs. [10,876]
         [[node Equal (defined at train.py:164) ]]
  (1) Invalid argument:  Incompatible shapes: [10,92] vs. [10,876]
         [[node Equal (defined at train.py:164) ]]
         [[ctc_loss/Log/_62]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3156]

Function call stack:
train_function -> train_function

It looks like it is complaining that the length of y_true (92) is not the same as the length of y_pred (876), which I thought should not be. What am I missing?

Solution

At least for the last versions of Tensorflow (2.2 and above), the Softmax layer support masking, the output of the masked values are not zeros but tf just repeat the precedent value.

features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
                [1.7, 2.6, 3.4 ,2.3 ,1.0]])

input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)

x = tf.keras.layers.GRU(2, return_sequences=True)(x)

output_ = tf.keras.layers.Softmax(axis=-1)(x)

model = tf.keras.Model(input_,output_)

r = model(features)
print(r)

the output of the first sample has repeated values correspond to the mask:

<tf.Tensor: shape=(2, 5, 2), dtype=float32, numpy=array([[[0.53308547, 0.46691453],
    [0.5477166 , 0.45228338],
    [0.55216545, 0.44783455],
    [0.55216545, 0.44783455],
    [0.55216545, 0.44783455]],

   [[0.532052  , 0.46794805],
    [0.54557794, 0.454422  ],
    [0.55263203, 0.44736794],
    [0.56076777, 0.4392322 ],
    [0.5722393 , 0.42776066]]], dtype=float32)>

To get the non_masked value of the sequence( label_length ),I'm using tf.version == 2.2 and that work for me :

get_mask = r._keras_mask

you can extract label_length from get_mask tensor value :

   <tf.Tensor: shape=(2, 5), dtype=bool, numpy=array([[ True,  
    True,  True, False, False],
   [ True,  True,  True,  True,  True]])>

or you can get the label_length by counting the values in the tensor y_true that differ from zero:

label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True)

for the value of logit_length , all the implementation I have seen just return the length of time_step, So the logit_length can be :

logit_length = tf.ones(shape = (your_batch_size ,1 ) * time_step

or you can use the mask tensor to just get the unmasked time_step :

logit_length = tf.reshape(tf.reduce_sum( 
        tf.cast(y_pred._keras_mask,tf.float32),axis=1),(your_batch_size,-1) )

this is a complete example :

features = np.array([[1.5, 2.3 ,4.6, 0.0 ,0.0],
                [1.5, 2.3 ,4.6, 2.0 ,1.0]]).reshape(2,5,1)  
labels = np.array([[1., 2. ,3., 0. ,0.],
               [1., 2. ,3., 2. ,1.]]).reshape(2,5 ) 

input_ = tf.keras.Input(shape=(5,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(5, return_sequences=True)(x)# 5 is the number of classes + blank .(in your case == 26 + 1)
output_ =  tf.keras.layers.Softmax(axis = -1)(x) 

model = tf.keras.Model(input_,output_)


def ctc_loss(y_true, y_pred):

  label_length = tf.math.count_nonzero(y_true, axis=-1, keepdims=True) 
  logit_length = tf.reshape(tf.reduce_sum(
                 tf.cast(y_pred._keras_mask,tf.float32),axis=1),(2,-1) ) 
                      
  loss =tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,
                  label_length)
  return  tf.reduce_mean(loss)

model.compile(loss =ctc_loss , optimizer = 'adam')
model.fit(features , labels ,epoch = 10)