nlpspeech-recognitionstate-machinefinite-state-automatonautomatic-speech-recognition

How does placing the output (word) labels on the initial transitions of the words in an FST lead to effective composition?


I am going through hbka.pdf (WFST paper). https://cs.nyu.edu/~mohri/pub/hbka.pdf

A WFST figure for reference

Here the input label i, the output label o, and weight w of a transition are marked on the corresponding directed arc by i: o/w.

It does not make sense as to how a transducer can output the entire word at the initial transition itself. If the entire word was outputted at the final transition, it is sensible to me.

Later I saw the following in Page 19,

"In order for this transducer to efficiently compose with G, the output (word) labels must be placed on the initial transitions of the words; other locations would lead to delays in the composition matching, which could consume significant time and space."

ChatGPT answers that "placing the output labels on the initial transitions of the words in the Word bigram transducer enables more efficient composition with another transducer by optimizing the matching and combination of transitions."

But how exactly does it happen?

"Placing the output labels on the initial transitions ensures that the word transitions in the Word bigram transducer align directly with the transitions in the other transducer."

But still, the entire word which the Finite state transducer has to figure out using phones as input symbols like d,ey,dx,ax, how can it be the output of the initial transition?


Solution

  • As far as I understand it, while the output is determined in the first transition, it is only actually produced once a final state is reached. So in a way the output is hypothesised, and subsequent transitions are used to test the hypothesis. If no final state is reached, the output so far is discarded.

    On advantage is that multiple paths with identical output (the upper path in your example image), do not have to repeat the output in every final transition. Also, if you have inputs with similar endings you can merge the paths later; which might make the FST more efficient. Imaging how many English words end in /ing/ or /ed/ or /s/ -- these can all point to the same identical final states, but not if the output is generated at the end.

    I guess a further reason is that this makes it easier to manipulate the FST when it is combined with other FSTs. It is always easier to push the output generation further backwards if you merge two FSTs, rather than deal with it when it is already at the end of the path.