javacmusphinxsphinx4

Sphinx4 Dutch Language Model not working


I just created a language model from a short text file. I did this for both English and Dutch, primarily to reduce recognition times by decreasing the possiblilities. I both created them using the Sphinx toolkit and the basesphinx lm to binary converter. The dutch language model can be found here: http://pastebin.com/txkxiAc6 The English one can be found here: http://pastebin.com/fr3Epj5b They are both small, but the english one recognizes everything it needs to recognize.

The Dutch one uses the Dutch Voxforge pack and dictionary. The English one uses cmusphinx-en-us-8khz-5.2.tar.gz and the default dictionary from pocketsphinx.

The code goes is like this:

Public static main(){
     configuration = new Configuration();
     configuration.setAcousticModelPath("src/main/resources/"+language+"/model");
     configuration.setDictionaryPath("src/main/resources/"+language+"/dict.dict");
     configuration.setLanguageModelPath("src/main/resources/"+language+"/model.lm.bin");
     context = new Context(configuration);
     recognizer = context.getInstance(Recognizer.class);
     recognizer.allocate();

     ----------GET INPUT STREAM AND SEND TO METHOD-------------

      RecognizeText(inputstream,outputstream)
}

private static String RecognizeText(InputStream stream, OutputStream os) throws Exception {
        context.setSpeechSource(stream, TimeFrame.INFINITE);
        Result result;
        while ((result = recognizer.recognize()) != null) {
            SpeechResult speechResult = new SpeechResult(result);
            return speechResult.getHypothesis();
        }
        return "";
    }

The 'language' variable can be set to Dutch or English for the correct language. English works, but Dutch doesn't. Where is my error? I can't seem to find it.

The Dutch Acoustic Model folder contains the following:

feat.params
mdef
means
mixture_weights
noisedict
transition_matrices
variances

Solution

  • Dutch model was very old, it has not been updated for 5 years. I've just uploaded a new model on cmusphinx website.

    https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Dutch/

    It should be more accurate but still it is trained only with 13 hours of data. English models are trained with 1000+ hours. We need more transcribed Dutch data.