I am trying to perform speech to text on a bunch of audio files which are over 10 mins long. I don't want to waste storage on the cloud bucket by straight-up uploading wav files on it. So I am using ffmpeg
to convert the files either to ogg or mp3 like:
ffmpeg -y -i audio.wav -ar 12000 -r 16000 audio.mp3
ffmpeg -y -i audio.wav -ar 12000 -r 16000 audio.ogg
For testing purpose I ran the speech to text service on a dummy wav file and it seemed to work, I got the text as expected. But for some reason it isn't detecting any speech when I use the ogg or mp3 file. I could not give amr files to work either.
My code:
def transcribe_gcs(gcs_uri):
client = speech.SpeechClient()
audio = speech.RecognitionAudio(uri=gcs_uri)
config = speech.RecognitionConfig(
encoding="OGG_OPUS", #replace with "LINEAR16" for wav, "OGG_OPUS" for ogg, "AMR" for amr
sample_rate_hertz=16000,
language_code="en-US",
)
print("starting operation")
operation = client.long_running_recognize(config=config, audio=audio)
response = operation.result()
print(response)
I have set up the authentication properly, so that is not a problem.
When I run the speech to text service on the same audio but in ogg or mp3(I just comment out the encoding setting from the config for mp3) format, it gives no response, just prints out a line break and done.
What can I do to fix this?
FLAC is compressed but is lossless. This will result in the best speech-to-text results.
ffmpeg -i input.wav -vn output.flac
If file space is very important then use Opus in OGG. It can make small file sizes with excellent quality.
ffmpeg -i input.wav -vn -c:a libopus output.ogg