python web-scraping nlp youtube video-streaming

How to extract subtitles from Youtube videos in varied languages

I have used the code below to extract subtitles from YouTube videos, but it only works for videos in English. I have some videos in Spanish, so I would like to know how I can modify the code to extract Spanish subtitles too?

from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi

# Define the video URL or ID of the YouTube video you want to extract text from
video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'

# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()

# Get the downloaded video file path
video_path = video.default_filename

# Get the video ID from the URL
video_id = video_url.split('v=')[-1]

# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)

# Extract the text from the transcript
captions_text = ''
for segment in transcript:
    caption = segment['text']
    captions_text += caption + ' '

# Print the extracted text
print(captions_text)

Solution

Use - list_transcripts - for get the list of available languages:

Example:

video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)

Then, loop the transcript_list variable to see the available languages obtained:

Example:

for x, tr in enumerate(transcript_list):
  print(tr.language_code)

In this case, the result is:

es

Modify your code for loop the languages available on the video and download the generated captions:

Example:

# Variables for store the downloaded captions:
all_captions = []
caption = None
captions_text = ''

# Loop all languages available for this video and download the generated captions:
for x, tr in enumerate(transcript_list):
  print("Downloading captions in " + tr.language + "...")
  transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
  for segment in transcript_obtained_in_language:
    caption = segment['text']
    captions_text += caption + ' '
  all_captions.append({"language " : tr.language_code + " - " + tr.language, "captions" : captions_text})
  caption = None
  captions_text = ''
  print("="*20)
print("Done")

In the all_captions variable, will be stored the captions and the language obtained from the given VIDEO_ID.