I just created a language model from a short text file. I did this for both English and Dutch, primarily to reduce recognition times by decreasing the possiblilities. I both created them using the Sphinx toolkit and the basesphinx lm to binary converter. The dutch language model can be found here: http://pastebin.com/txkxiAc6 The English one can be found here: http://pastebin.com/fr3Epj5b They are both small, but the english one recognizes everything it needs to recognize.
The Dutch one uses the Dutch Voxforge pack and dictionary. The English one uses cmusphinx-en-us-8khz-5.2.tar.gz and the default dictionary from pocketsphinx.
The code goes is like this:
Public static main(){
configuration = new Configuration();
configuration.setAcousticModelPath("src/main/resources/"+language+"/model");
configuration.setDictionaryPath("src/main/resources/"+language+"/dict.dict");
configuration.setLanguageModelPath("src/main/resources/"+language+"/model.lm.bin");
context = new Context(configuration);
recognizer = context.getInstance(Recognizer.class);
recognizer.allocate();
----------GET INPUT STREAM AND SEND TO METHOD-------------
RecognizeText(inputstream,outputstream)
}
private static String RecognizeText(InputStream stream, OutputStream os) throws Exception {
context.setSpeechSource(stream, TimeFrame.INFINITE);
Result result;
while ((result = recognizer.recognize()) != null) {
SpeechResult speechResult = new SpeechResult(result);
return speechResult.getHypothesis();
}
return "";
}
The 'language' variable can be set to Dutch or English for the correct language. English works, but Dutch doesn't. Where is my error? I can't seem to find it.
The Dutch Acoustic Model folder contains the following:
feat.params
mdef
means
mixture_weights
noisedict
transition_matrices
variances
Dutch model was very old, it has not been updated for 5 years. I've just uploaded a new model on cmusphinx website.
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Dutch/
It should be more accurate but still it is trained only with 13 hours of data. English models are trained with 1000+ hours. We need more transcribed Dutch data.