pythontensorflow

How does tf.nn.ctc_greedy_decoder generates output sequences in tensorflow?


Given the logits (output from the RNN/Lstm/Gru in time major format i.e. (maxTime, batchSize, numberofClasses)), how does ctc greedy decoder performs decoding to generate output sequence.

I found this "Performs greedy decoding on the logits given in input (best path)" on its webpage https://www.tensorflow.org/api_docs/python/tf/nn/ctc_greedy_decoder.

One possibility is to select output class with maximum value at each time step, collapse repetitions and generate corresponding output sequence. Is it, ctc greedy decoder doing here or something else? Explanation using an example will be very useful.


Solution

  • The operation ctc_greedy_decoder implements best path decoding, which is also stated in the TF source code [1].

    Decoding is done in two steps:

    1. Concatenate most probable characters per time-step which yields the best path.
    2. Then, undo the encoding by first removing duplicate characters and then removing all blanks. This gives us the recognized text.

    Let's look at an example. The neural network outputs a matrix with 5 time-steps and 3 characters ("a", "b" and the blank "-"). We take the most likely character per time-step, which gives us the best path: "aaa-b". Then, we remove repeated characters and get "a-b". Finally, we remove all blanks and get "ab" as the result.

    best path decoding

    More information about CTC can be found in [2] and an example on how to use it in Python is shown in [3].


    [1] Implementation of ctc_greedy_decoder: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/ctc/ctc_decoder.h#L96

    [2] Further information about CTC, best path decoding and beam search decoding: https://harald-scheidl.medium.com/beam-search-decoding-in-ctc-trained-neural-networks-5a889a3d85a7

    [3] Sample code which shows how to use ctc_greedy_decoder: https://github.com/githubharald/SimpleHTR/blob/master/src/model.py#L129