speech-recognitionspeech-to-textspeechlabelingctc

CTC: What is the difference between space and blank?


In the 2006 article about Connectionist Temporal Classification, Alex Graves & co. introduced a model of decoding speech with 27 labels: 26 for the alphabet letters and one for blank, meaning no label (which I understand to be silence).

However, I am seeing a lot of implementations of CTC that use 28 labels, one being the blank and another one being space. So far, I haven't been able to find an explanation for the need to use both these labels and, to me, they represent the same thing.

Could you please explain the difference between blank and space in the context of CTC and why there's a need for both these labels?


Solution

  • In Connectionist Temporal Classification space is just a whitespace and blank is '-' which we use to solve the repeated reoccurrence of the data. for example "pizza" will be encoded as "piz-za".

    TLDR;

    ref: https://towardsdatascience.com/beam-search-decoding-in-ctc-trained-neural-networks-5a889a3d85a7

    In CTC there is an issue of how to encode duplicate characters. It is solved by introducing a pseudo-character (called blank, but don’t confuse it with a “real” blank, i.e. a white-space character). This special character will be denoted as “-” in the text. We use a clever coding schema to solve the duplicate-character problem: when encoding a text, we can insert arbitrary many blanks at any position, which will be removed when decoding it. However, we must insert a blank between duplicate characters like in “hello”. Further, we can repeat each character as often as we like. Let’s look at some examples: “to” → “---ttttttooo”, or “-t-o-”, or “to” “too” → “---ttttto-o”, or “-t-o-o-”, or “to-o”, but not “too” As you see, this schema also allows us to easily create different alignments of the same text, e.g. “t-o” and “too” and “-to” all represent the same text (“to”), but with different alignments to the image. The NN is trained to output an encoded text (encoded in the NN output matrix).