nlpspeech-recognitionkaldi

Kaldi's objects explained in layman's term


I am trying to understand the internal workings of Kaldi, however is having trouble understanding the technical details of kaldi's doc.

I want to have a high-level understanding of various objects first in order to help digest what is presented. I would specifically like to know what the .tree, fina.mdl, and HCLG.fst files are, what is needed to generate them and how they are being used.

Vaguely I understand that (please correct me if I am wrong):

I understand there is a lot to cover but any help is appreciated!


Solution

  • You'd better ask one question at a time. Also, it is better to read the book to understand the theory first instead of trying to grasp all at once.

    final.mdl is the acoustic model and contains the probability of transitioning from one phone to another

    The main component of the acoustic model model final.mdl is the acoustic detectors, not transitioning probabilities. It is either a set of GMMs for phones or a neural network. The acoustic model also contain the transition probabilities from one hmm state to another, what builds HMM model for a single phone. The transition probabilities between phones are encoded in the graph HCLG.fst

    HCLG.fst is a graph that given a sequence of phones it will generate the most likely word sequence based on the lexicon, grammar and language model.

    Not quite that, HCLG fst is a finite state transducer that gives you probability of a state sequence based on lexicon and language model. Phone sequences are not really used in graph, they are accounted on graph construction.

    not quite sure what adding a self-loop is, is it similar to the Kleene operator?

    Speech HMM has self-loops for every state, it allows the state to last for several input frames. You can find the HMM topology in the book to see the loops.

    lattice contain alternative word-sequence for an utterance.

    This is correct, but it also contains times and acoustic and language model scores.