google-cloud-platformzshjqvideo-intelligence-apigoogle-speech-to-text-api

How do I extract transcript with multiple speakers from Google Video Intelligence API Speech Transcription JSON output using jq?


I'm testing out Google Video Intelligence speech-to-text for transcribing podcast episodes with multiple speakers.

I've extracted an example and published that to a gist: output.json.

cat file.json | jq '.response.annotationResults[].speechTranscriptions[].alternatives[] | {startTime: .words[0].startTime, segment: .transcript }'

Above command will print out the startTime of each segment, along with the segment itself. jq-output.json

{
  "time": "6.400s",
  "segment": "Hi, my name is Melinda Smith from Noble works. ...snip"
}
{
  "time": "30s",
  "segment": " Any Graham as a tool for personal and organizational ...snip"
}

What I'm aiming for is to have the speakerTagfor each segment included in my jq output.

This is where I'm stuck... to start, each array within .alternatives[] contains .transcript a string containing that segment, .confidence, and .words[] an array with each word of that segment and the time it was spoken.

That part of the JSON is how I get the first part of the output. Then, after it's gone through each segment of the transcript, at the bottom, it has one last .alternatives[] array, containing (again) each word from the entire transcript, one at a time, along with it's startTime, endTime, and speakerTag.

Here's a simplified example of what I mean:

speechTranscriptions:
  alternatives:
    transcript: "Example transcript segment"
    words:
      word: "Example"; startTime: 0s;
      word: "transcript"; startTime: 1s;
      word: "segment"; startTime: 2s;
  alternatives:
    transcript: "Another transcript segment"
    words:
      word: "Another"; startTime: 3s;
      word: "transcript"; startTime: 4s;
      word: "segment"; startTime: 5s;
  alternatives:
    words:
      word: "Example"; startTime: 0s; speakerTag: 1;
      word: "transcript"; startTime: 1s; speakerTag: 1;
      word: "segment"; startTime: 2s; speakerTag: 1;
      word: "Another"; startTime: 3s; speakerTag: 2;
      word: "transcript"; startTime: 4s; speakerTag: 2;
      word: "segment"; startTime: 5s; speakerTag: 2;

What I was thinking is to somehow go through the jq-output.json, and match each startTime with it's corresponding speakerTag found in the original Video Intelligence API output.

.response.annotationResults[].speechTranscriptions[].alternatives[] | ( if .words[].speakerTag then {time: .words[].startTime, speaker: .words[].speakerTag} else empty end)

I tried a few variations of this, with the idea to print out only start-time and speakerTag, then match the values in my next step. My problem was not understanding how to only print the startTime if it has a corresponding speakerTag.

As mentioned in the comments, it would be preferable to generate this result in one command, but I was just trying to break the problem down into parts I could attempt to understand.


Solution

  • My problem was not understanding how to only print the startTime if it has a corresponding speakerTag.

    This could be accomplished using the filter:

    .response.annotationResults[].speechTranscriptions[].alternatives[].words[]
     | select(.speakerTag)
     | {time: .startTime, speaker: .speakerTag}
    

    So perhaps the following is a solution (or at least close to a solution) to the main problem:

    .response.annotationResults[].speechTranscriptions[].alternatives[]
    | (INDEX(.words[] | select(.speakerTag); .startTime) | map_values(.speakerTag)) as $dict
    | {startTime: .words[0].startTime, segment: .transcript}
    | . + {speaker: $dict[.startTime]}