pythonwindowsazure-cognitive-servicesvoice-recognitionphonetics

Phoneme-level Pronunciation Correctness Microsoft Speech


I'm doing some play around with the Pronunciation Assessment service by Microsoft Cognitive Service (using Python API). Currently, I can display the phoneme breakdown (along with the confidence score) based on what the reference text I passed in the request. My question is: is there any way to get the phoneme breakdown of what it was really said? In other words.. it is possible to get as output what phonemes are detected instead of the phonemes the system is waiting to recognized according to the reference text?

This pictures the output I currently have. But instead of getting the phonemes that composed the word "can't" I would like to get the phonemes of what the word passed in the output was

            {
                "Word": "can't", 
                "AccuracyScore": 85.0, 
                "ErrorType": "None", 
                "Offset": 39900000, 
                "Duration": 6500000, 
                "Phonemes": [
                    {
                        "Duration": 1300000, 
                        "Phoneme": "k", 
                        "AccuracyScore": 89.0, 
                        "Offset": 39900000
                    }, 
                    {
                        "Duration": 800000, 
                        "Phoneme": "aa", 
                        "AccuracyScore": 86.0, 
                        "Offset": 41300000
                    }, 
                    {
                        "Duration": 1600000, 
                        "Phoneme": "n", 
                        "AccuracyScore": 74.0, 
                        "Offset": 42200000
                    }, 
                    {
                        "Duration": 2500000, 
                        "Phoneme": "t", 
                        "AccuracyScore": 89.0, 
                        "Offset": 43900000
                    }
                ]
            }, 

Thanks in advance


Solution

  • Go through the document of Pronunciation assessment and the sample code on Github, it seems we can get what the speaker said by print reference_text.

    enter image description here

    You can also do it by PronunciationAssessmentConfig.to_json()(pronunciation_config.to_json()) to get all of the parameters(include the reference_text in it).