androidvoice-recognitioncmusphinxpocketsphinxpocketsphinx-android

Pocketsphinx - perfecting hot-word detection


I've revisited CMU Sphinx recently and attempted to set up a basic hot-word detector for Android, starting from the tutorial and adapting the sample application.

I'm having various issues, which I've been unable to resolve, despite delving deep into their documentation, until I can read no more...

In order to replicate them, I made a basic project that was designed to detect the keywords wakeup you and wakeup me.

My dictionary:

me M IY
wakeup W EY K AH P
you Y UW

My language model:

\data\
ngram 1=5
ngram 2=5
ngram 3=4

\1-grams:
-0.9031 </s> -0.3010
-0.9031 <s> -0.2430
-1.2041 me -0.2430
-0.9031 wakeup -0.2430
-1.2041 you -0.2430

\2-grams:
-0.3010 <s> wakeup 0.0000
-0.3010 me </s> -0.3010
-0.6021 wakeup me 0.0000
-0.6021 wakeup you 0.0000
-0.3010 you </s> -0.3010

\3-grams:
-0.6021 <s> wakeup me
-0.6021 <s> wakeup you
-0.3010 wakeup me </s>
-0.3010 wakeup you </s>

\end\

Both of the above were created using the suggested tool.

And my key-phrases file:

wakeup you /1e-20/
wakeup me /1e-20/

Adapting the example application linked above, here is my code:

public class PocketSphinxActivity extends Activity implements RecognitionListener {

    private static final String CLS_NAME = PocketSphinxActivity.class.getSimpleName();

    private static final String HOTWORD_SEARCH = "hot_words";

    private volatile SpeechRecognizer recognizer;

    @Override
    public void onCreate(Bundle state) {
        super.onCreate(state);
        setContentView(R.layout.main);

        new AsyncTask<Void, Void, Exception>() {
            @Override
            protected Exception doInBackground(Void... params) {
                Log.i(CLS_NAME, "doInBackground");

                try {

                    final File assetsDir = new Assets(PocketSphinxActivity.this).syncAssets();

                    recognizer = defaultSetup()
                            .setAcousticModel(new File(assetsDir, "en-us-ptm"))
                            .setDictionary(new File(assetsDir, "basic.dic"))
                            .setKeywordThreshold(1e-20f)
                            .setBoolean("-allphone_ci", true)
                            .setFloat("-vad_threshold", 3.0)
                            .getRecognizer();

                    recognizer.addNgramSearch(HOTWORD_SEARCH, new File(assetsDir, "basic.lm"));
                    recognizer.addKeywordSearch(HOTWORD_SEARCH, new File(assetsDir, "hotwords.txt"));
                    recognizer.addListener(PocketSphinxActivity.this);

                } catch (final IOException e) {
                    Log.e(CLS_NAME, "doInBackground IOException");
                    return e;
                }

                return null;
            }

            @Override
            protected void onPostExecute(final Exception e) {
                Log.i(CLS_NAME, "onPostExecute");

                if (e != null) {
                    e.printStackTrace();
                } else {
                    recognizer.startListening(HOTWORD_SEARCH);
                }
            }
        }.execute();
    }

    @Override
    public void onBeginningOfSpeech() {
        Log.i(CLS_NAME, "onBeginningOfSpeech");
    }

    @Override
    public void onPartialResult(final Hypothesis hypothesis) {
        Log.i(CLS_NAME, "onPartialResult");

        if (hypothesis == null)
            return;

        final String text = hypothesis.getHypstr();
        Log.i(CLS_NAME, "onPartialResult: text: " + text);

    }

    @Override
    public void onResult(final Hypothesis hypothesis) {
        // unused
        Log.i(CLS_NAME, "onResult");
    }

    @Override
    public void onEndOfSpeech() {
        // unused
        Log.i(CLS_NAME, "onEndOfSpeech");
    }


    @Override
    public void onError(final Exception e) {
        Log.e(CLS_NAME, "onError");
        e.printStackTrace();
    }

    @Override
    public void onTimeout() {
        Log.i(CLS_NAME, "onTimeout");
    }

    @Override
    public void onDestroy() {
        super.onDestroy();
        Log.i(CLS_NAME, "onDestroy");

        recognizer.cancel();
        recognizer.shutdown();
    }
}

Note:- Should I alter my selected key-phrases (and other related files) to be more dissimilar and I test the implementation in a quiet environment, the setup and thresholds applied work very successfully.

Problems

  1. When I say either wakeup you or wakeup me, both will be detected.

I can't establish how to apply an increased weighting to the end syllables.

  1. When I say just wakeup, often (but not always) both will be detected.

I can't establish how I can avoid this occurring.

  1. When testing against background noise, the false positives are too frequent.

I can't lower the base thresholds I am using, otherwise the keyphrases are not detected consistently under normal conditions.

  1. When testing against background noise for a long period (5 minutes should be sufficient to replicate), returning immediately to a quiet environment and uttering the keyphrases, results in no detection.

It takes an undetermined period of time before the keyphrases are detected successfully and repeatedly - as though the test had begun in a quiet environment.

I found a potentially related question, but the links no longer work. I wonder if I should be resetting the recogniser more frequently, so to somehow reset the background noise from being averaged into the detection thresholds?

  1. Finally, I wonder if my requirements for limited keyphrases, would allow me to reduce the size of the acoustic model?

Any overhead when packaging within my application would of course be beneficial.

Very finally (honest!), and specifically hoping that @NikolayShmyrev will spot this question, are there any plans to wrap a base Android implementation/sdk entirely via gradle?

I thank you to those who made it this far...


Solution

  • My language model:

    You do not need language model since you do not use it.

    I can't lower the base thresholds I am using, otherwise the keyphrases are not detected consistently under normal conditions.

    1e-20 is a reasonable threshold, you can provide the sample recording where you have false detections to give me better idea what is going on.

    When testing against background noise for a long period (5 minutes should be sufficient to replicate), returning immediately to a quiet environment and uttering the keyphrases, results in no detection.

    This is an expected behavior. Overall, long background noise makes it harder for recognizer to quickly adapt to audio parameters. If your task is to spot words in noisy place, it's better to use some kind of hardware noise cancellation, for example, a bluetooth headset with a noise cancellation.

    Finally, I wonder if my requirements for limited keyphrases, would allow me to reduce the size of the acoustic model?

    It is not possible now. If you look just for spotting you can try https://snowboy.kitt.ai