azureresttext-to-speechazure-speech

Azure speech to text REST API V3 binary data


I'm trying to use Azure Speech to text service. In the documentation I'm confronted with examples, that use V1 API version: https://$region.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1

And basically every link to proper documentation is for the V3 API.

https://{endpoint}/speechtotext/v3.0

In this V1 example you can easily send your file as binary.

curl --location --request POST \
"https://$region.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US" \
--header "Ocp-Apim-Subscription-Key: $key" \
--header "Content-Type: audio/wav" \
--data-binary $audio_file

But I could not figure it out how to provide an wordLevelTimestampsEnabled=true parameter for getting word level timestamps.

On the other hand, I tried using the V3 API, and I can easily provide wordLevelTimestampsEnabled=true parameter, but I couldn't figure out how to send binary file data.

curl -L -X POST 'https://northeurope.api.cognitive.microsoft.com/speechtotext/v3.0/transcriptions' -H 'Content-Type: application/json' -H 'Accept: application/json' -H 'Ocp-Apim-Subscription-Key: $key' --data-raw '{
  "contentUrls": [
    "https://url-to-file.dev/test-file.wav"
  ],
  "properties": {
    "diarizationEnabled": false,
    "wordLevelTimestampsEnabled": true,
    "punctuationMode": "DictatedAndAutomatic",
    "profanityFilterMode": "Masked"
  },
  "locale": "pl-PL",
  "displayName": "Transcription using default model for pl-PL"
}'

Is there a way to pass a binary file and also get word level timestamps with wordLevelTimestampsEnabled=true parameter?


Solution

  • Is there a way to pass a binary file and also get word level timestamps with wordLevelTimestampsEnabled=true parameter?

    As suggested by Code Different, converting a comment as a community wiki answer to help community members who might face a similar issue.

    As per the documentation, binary file can't be uploaded directly. You should provide URL via contentUrls property.

    For example:

    {
      "contentUrls": [
        "<URL to an audio file to transcribe>",
      ],
      "properties": {
        "diarizationEnabled": false,
        "wordLevelTimestampsEnabled": true,
        "punctuationMode": "DictatedAndAutomatic",
        "profanityFilterMode": "Masked"
      },
      "locale": "en-US",
      "displayName": "Transcription of file using default model for en-US"
    }
    

    You can refer to Speech-to-text REST API v3.0, cognitive-services-speech-sdk and Azure Speech Recognition - use binary / hexadecimal data instead of WAV file path