I have used the code below to extract subtitles from YouTube videos, but it only works for videos in English. I have some videos in Spanish, so I would like to know how I can modify the code to extract Spanish subtitles too?
from pytube import YouTube
from youtube_transcript_api import YouTubeTranscriptApi
# Define the video URL or ID of the YouTube video you want to extract text from
video_url = 'https://www.youtube.com/watch?v=xYgoNiSo-kY'
# Download the video using pytube
youtube = YouTube(video_url)
video = youtube.streams.get_highest_resolution()
video.download()
# Get the downloaded video file path
video_path = video.default_filename
# Get the video ID from the URL
video_id = video_url.split('v=')[-1]
# Get the transcript for the specified video ID
transcript = YouTubeTranscriptApi.get_transcript(video_id)
# Extract the text from the transcript
captions_text = ''
for segment in transcript:
caption = segment['text']
captions_text += caption + ' '
# Print the extracted text
print(captions_text)
Use - list_transcripts - for get the list of available languages:
Example:
video_id = 'xYgoNiSo-kY'
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
Then, loop the transcript_list
variable to see the available languages obtained:
Example:
for x, tr in enumerate(transcript_list):
print(tr.language_code)
In this case, the result is:
es
Modify your code for loop the languages available on the video and download the generated captions:
Example:
# Variables for store the downloaded captions:
all_captions = []
caption = None
captions_text = ''
# Loop all languages available for this video and download the generated captions:
for x, tr in enumerate(transcript_list):
print("Downloading captions in " + tr.language + "...")
transcript_obtained_in_language = transcript_list.find_transcript([tr.language_code]).fetch()
for segment in transcript_obtained_in_language:
caption = segment['text']
captions_text += caption + ' '
all_captions.append({"language " : tr.language_code + " - " + tr.language, "captions" : captions_text})
caption = None
captions_text = ''
print("="*20)
print("Done")
In the all_captions
variable, will be stored the captions and the language obtained from the given VIDEO_ID
.