google-cloud-platformspeech-recognitionvideo-intelligence-api

Google Cloud Speech Transcription for Video Intelligence


I intend to use Google Cloud Speech Transcription for Video Intelligence. The following code only analysis for a partial segment of the video.

video_uri = "gs://cloudmleap/video/next/JaneGoodall.mp4"
language_code = "en-GB"
segment = types.VideoSegment()
segment.start_time_offset.FromSeconds(55)
segment.end_time_offset.FromSeconds(80)
response = transcribe_speech(video_uri, language_code, [segment])

def transcribe_speech(video_uri, language_code, segments=None):
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [enums.Feature.SPEECH_TRANSCRIPTION]
    config = types.SpeechTranscriptionConfig(
        language_code=language_code,
        enable_automatic_punctuation=True,
    )
    context = types.VideoContext(
        segments=segments,
        speech_transcription_config=config,
    )

    print(f'Processing video "{video_uri}"...')
    operation = video_client.annotate_video(
        input_uri=video_uri,
        features=features,
        video_context=context,
    )
    return operation.result()

How can I automatically analyse the whole video rather than defining a particular segment ?


Solution

  • You can follow this tutorial in Video Intelligence google doc. This tutorial shows how to transcribe a whole video. Your input should be stored in a GCS bucket and I see that in your sample code, your video is indeed stored in a GCS bucket so you should not have any issues with this.

    Just make sure that you have installed the latest Video Intelligence library.

    pip install --upgrade google-cloud-videointelligence
    

    Here is the the code snippet from the Video Intelligence doc for transcribing audio:

    """Transcribe speech from a video stored on GCS."""
    from google.cloud import videointelligence
    
    path="gs://your_gcs_bucket/your_video.mp4"
    video_client = videointelligence.VideoIntelligenceServiceClient()
    features = [videointelligence.Feature.SPEECH_TRANSCRIPTION]
    
    config = videointelligence.SpeechTranscriptionConfig(
        language_code="en-US", enable_automatic_punctuation=True
    )
    video_context = videointelligence.VideoContext(speech_transcription_config=config)
    
    operation = video_client.annotate_video(
        request={
            "features": features,
            "input_uri": path,
            "video_context": video_context,
        }
    )
    
    print("\nProcessing video for speech transcription.")
    
    result = operation.result(timeout=600)
    
    # There is only one annotation_result since only
    # one video is processed.
    annotation_results = result.annotation_results[0]
    for speech_transcription in annotation_results.speech_transcriptions:
    
        # The number of alternatives for each transcription is limited by
        # SpeechTranscriptionConfig.max_alternatives.
        # Each alternative is a different possible transcription
        # and has its own confidence score.
        for alternative in speech_transcription.alternatives:
            print("Alternative level information:")
    
            print("Transcript: {}".format(alternative.transcript))
            print("Confidence: {}\n".format(alternative.confidence))
    
            print("Word level information:")
            for word_info in alternative.words:
                word = word_info.word
                start_time = word_info.start_time
                end_time = word_info.end_time
                print(
                    "\t{}s - {}s: {}".format(
                        start_time.seconds + start_time.microseconds * 1e-6,
                        end_time.seconds + end_time.microseconds * 1e-6,
                        word,
                    )
                )