speech-recognitionspeech-to-texthidden-markov-modelsmfcchtk

HTK - What do MFCCs of an HMM model and Input WAV File represent?


While creating MFCCs following Voxforge's tutorial for a Speech to Text System using HTK (Hidden Markov Model Tool Kit), we are required to define a prototype model for our phones. I am trying to wrap my head around this this file.

~o <VecSize> 25 <MFCC_0_D_N_Z>
~h "proto"
<BeginHMM>
  <NumStates> 5
  <State> 2
    <Mean> 25
      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0  
    <Variance> 25
      1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 
 <State> 3
    <Mean> 25
      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0  
    <Variance> 25
      1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
 <State> 4
    <Mean> 25
      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
    <Variance> 25
      1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
 <TransP> 5
  0.0 1.0 0.0 0.0 0.0
  0.0 0.6 0.4 0.0 0.0
  0.0 0.0 0.6 0.4 0.0
  0.0 0.0 0.0 0.7 0.3
  0.0 0.0 0.0 0.0 0.0
<EndHMM>

In this case, we are using a feature vector of Length 25 to represent every state of the HMM. However, I don't quite understand why we have 25 "Means" and "Variances" for every state. Do they represent the Mean and Variance of every Feature Vector?

Furthermore, why Do we have 3 states when is 5? Are <State>1 and <State>5 simply entry and exit points so they do not require a Mean and Variance?

Also, while taking sample wav files, I printed the MFCCs which displayed as below:

  0:     -15.769  -2.168   8.605   4.979   5.283   1.012   9.631  -0.619   3.622  10.977
             5.733   3.260  44.447  -0.153  -0.281  -0.810  -1.176   0.363  -0.658   0.676
            -1.569   1.363  -1.221   0.815  -0.759   1.427
    1:     -18.345  -3.220   7.177   0.293   7.232   3.111  17.942  -6.957   8.197   6.579
             9.102  -0.569  49.537   0.378  -0.337  -1.277  -1.709   0.623  -0.450   0.162
             0.315   2.088  -1.175   0.624   0.762   1.018
    2:     -15.244  -3.046   5.269   1.441   6.121  -3.326   8.854  -5.297   8.151   7.072
             8.122   1.379  49.036   0.543  -0.119  -1.162  -1.263   1.261  -0.388  -0.234
             0.816   1.195  -1.237  -0.288   1.600   0.244
    3:     -14.143  -3.413   3.887  -1.796   7.981   0.930  10.826   3.294  11.797   7.055
             7.661   8.011  47.243   0.613  -0.020  -0.568  -0.364   1.034  -0.165  -0.812
             2.525   0.351  -1.670  -1.086   1.493  -0.716
    4:     -15.156  -2.669   4.440  -0.293  11.213   0.162  12.020  -1.667   7.794   4.553
             5.013   6.968  46.813  -0.050  -0.092  -0.050  -0.329   0.325   0.585   0.751
             1.253  -0.008  -1.852  -0.845   0.058  -0.430
    5:     -15.323  -3.510   4.750  -0.660   9.856   0.545  12.301   3.855  10.132  -0.511
             5.224   4.104  47.068   0.073   0.151   0.163  -0.180  -0.186  -0.242  -0.335
            -0.577  -0.479  -0.745  -0.167  -1.565   0.013

For every "window", why do we have 26 coeffieincts instead of 25? What do they all represent? I believe:

But I have no idea what 13th number in each of these samples represent. They should be of the format <MFCC_0_D_N_Z> as defined in the prototype file displayed in the beginning, which is not explained well in the HTK Manual. But I can garner from page 80 of the Manual that :

Any explanations would be appreciated.


Solution

  • Furthermore, why Do we have 3 states when is 5? Are 1 and 5 simply entry and exit points so they do not require a Mean and Variance?

    Yes, boundary states are dummy.

    For every "window", why do we have 26 coeffieincts instead of 25? What do they all represent? I believe:

    MFCC type is MFCC_0_D as in Tutorial step 5, so those are 13 ceps and 13 deltas. You can also use HList -o -h to print the exact layout:

    ---------------------------------- Source: ar-03.mfc -----------------------------------
      Sample Bytes:  52       Sample Kind:   MFCC_D_C_K_0
      Num Comps:     26       Sample Period: 10000.0 us
      Num Samples:   648      File Format:   HTK
    -------------------------------- Observation Structure ---------------------------------
    x:      MFCC-1  MFCC-2  MFCC-3  MFCC-4  MFCC-5  MFCC-6  MFCC-7  MFCC-8  MFCC-9 MFCC-10
           MFCC-11 MFCC-12      C0   Del-1   Del-2   Del-3   Del-4   Del-5   Del-6   Del-7
             Del-8   Del-9  Del-10  Del-11  Del-12   DelC0
    

    The type of features stored in mfc file might differ from the type of features used in HMM training, the HMM features are computed from mfc on the fly according to the proto specification, so on the disk you have 26 MFCC_0_D and when you compute you convert it to 25 coefficients MFCC_0_D_N_Z by dropping the energy and normalizing the mean.

    I don't quite understand why we have 25 "Means" and "Variances" for every state. Do they represent the Mean and Variance of every Feature Vector?

    Means and variances are gaussian parameters of the HMM emitting distribution for every HMM state, they are not the mean of feature vector. Check what HMM is.