audiodeep-learningartificial-intelligencelstmmfcc

I want to know 'd-vector' for speaker diarization


When segmented speech audio was added to DNN model, I understood that the average value of the features extracted from the last hidden layer is 'd-vector'. In that case, I want to know if the d-vector of the speaker can be extracted even if I put the voice of the speaker without learning. By using this, when a segmented value of a voice file spoken by multiple people (using a mel-filterbank or MFCC) is put in, can we distinguish the speaker by clustering the extracted d-vector value as mentioned before?


Solution

  • To answer your questions:

    1. After you train the model, you can get the d-vector simply by forward-propagating the input vector through the network. Normally you look at the output (final layer) of the ANN, but you can equally retrieve values from penultimate (the d-vector) layer.

    2. Yes, you can distinguish speakers with the d-vector, as it produces in a way a high-level embedding of the audio signal that will have unique features for different people. See e.g. this paper.