I am using python3 to transcribe an audio file with Google speech-to-text via the provided python packages (google-speech).
There is an option to define custom phrases which should be used for transcription as stated in the docs: https://cloud.google.com/speech-to-text/docs/speech-adaptation
For testing purposes I am using a small audio file with the contained text:
[..] in this lecture we'll talk about the Burrows wheeler transform and the FM index [..]
And I am giving the following phrases to see the effects if for example I want a specific name to be recognized with the correct notation. In this example I want to change burrows to barrows:
config = speech.RecognitionConfig(dict(
encoding=speech.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED,
sample_rate_hertz=24000,
language_code="en-US",
enable_word_time_offsets=True,
speech_contexts=[
speech.SpeechContext(dict(
phrases=["barrows", "barrows wheeler", "barrows wheeler transform"]
))
]
))
Unfortunately this does not seem to have any effect as the output is still the same as without the context phrases.
Am I using the phrases wrong or has it such a high confidence that the word it hears is indeed burrows so that it will ignore my phrases?
PS: I also tried using the speech_v1p1beta1.AdaptationClient
and speech_v1p1beta1.SpeechAdaptation
instead of putting the phrases into the config but this only gives me an internal server error with no additional information on what is going wrong. https://cloud.google.com/speech-to-text/docs/adaptation
I have created an audio file to recreate your scenario and I was able to improve the recognition using the model adaptation. To achieve this with this feature, I would suggest taking a look at this example and this post to better understand the adaptation model.
Now, to improve the recognition of your phrase, I performed the following:
in this lecture we'll talk about the Burrows wheeler transform and the FM index
PhraseSet
and CustomClass
that includes the word you would like to improve, in this case the word "barrows". You can also create/update/delete the phrase set and custom class using the Speech-To-Text GUI. Below is the code I used for the improvement.from os import pathconf_names
from google.cloud import speech_v1p1beta1 as speech
import argparse
def transcribe_with_model_adaptation(
project_id="[PROJECT-ID]", location="global", speech_file=None, custom_class_id="[CUSTOM-CLASS-ID]", phrase_set_id="[PHRASE-SET-ID]"
):
"""
Create`PhraseSet` and `CustomClasses` to create custom lists of similar
items that are likely to occur in your input data.
"""
import io
# Create the adaptation client
adaptation_client = speech.AdaptationClient()
# The parent resource where the custom class and phrase set will be created.
parent = f"projects/{project_id}/locations/{location}"
# Create the custom class resource
adaptation_client.create_custom_class(
{
"parent": parent,
"custom_class_id": custom_class_id,
"custom_class": {
"items": [
{"value": "barrows"}
]
},
}
)
custom_class_name = (
f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
)
# Create the phrase set resource
phrase_set_response = adaptation_client.create_phrase_set(
{
"parent": parent,
"phrase_set_id": phrase_set_id,
"phrase_set": {
"boost": 0,
"phrases": [
{"value": f"${{{custom_class_name}}}", "boost": 10},
{"value": f"talk about the ${{{custom_class_name}}} wheeler transform", "boost": 15}
],
},
}
)
phrase_set_name = phrase_set_response.name
# print(u"Phrase set name: {}".format(phrase_set_name))
# The next section shows how to use the newly created custom
# class and phrase set to send a transcription request with speech adaptation
# Speech adaptation configuration
speech_adaptation = speech.SpeechAdaptation(
phrase_set_references=[phrase_set_name])
# speech configuration object
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=24000,
language_code="en-US",
adaptation=speech_adaptation,
enable_word_time_offsets=True,
model="phone_call",
use_enhanced=True
)
# The name of the audio file to transcribe
# storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
with io.open(speech_file, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
# audio = speech.RecognitionAudio(uri="gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav")
# Create the speech client
speech_client = speech.SpeechClient()
response = speech_client.recognize(config=config, audio=audio)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
# [END speech_transcribe_with_model_adaptation]
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("path", help="Path for audio file to be recognized")
args = parser.parse_args()
transcribe_with_model_adaptation(speech_file=args.path)
element already exists
message if try to re-create the custom class and the phrase set.(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the Burrows wheeler transform and the FM index
(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the barrows wheeler transform and the FM index
Finally, I would like to add some notes about the improvement and the code I performed:
I have used a flac
audio file as it is recommended for optimal results.
I have used the model="phone_call"
and use_enhanced=True
as this was the model recognized by Cloud Speech-To-Text using my own audio file. Also the enhanced model can provide better results, you can see the documentation for more details. Note that this configuration might vary from your audio file.
Consider enable data logging to Google to collect data from your audio transcription requests. Google then uses this data to improve its machine learning models used for recognizing speech audio.
Once I have create the custom class and the phrase set, you can use the Speech-to-Text UI to updae and perform your tests quickly. only contains the
I have used in the phrase set the parameter boost, when you use boost, you assign a weighted value to phrase items in a PhraseSet resource. Speech-to-Text refers to this weighted value when selecting a possible transcription for words in your audio data. The higher the value, the higher the likelihood that Speech-to-Text chooses that word or phrase from the possible alternatives.
I hope this information helps you to improve your recognitions.