I have installed pytube to extract captions from some youtube videos. Both the following code give me the xml captions.
from pytube import YouTube
yt = YouTube('https://www.youtube.com/watch?v=4ZQQofkz9eE')
caption = yt.captions['a.en']
print(caption.xml_captions)
and also as mentioned in the docs
yt = YouTube('http://youtube.com/watch?v=2lAe1cqCOXo')
caption = yt.captions.get_by_language_code('en')
caption.xml_captions
But in both cases, I get the xml output and when use
print(caption.generate_srt_captions())
I get an error like the following. Can you help on how to extract the srt format?
KeyError
~/anaconda3/envs/myenv/lib/python3.6/site-packages/pytube/captions.py in
generate_srt_captions(self)
49 recompiles them into the "SubRip Subtitle" format.
50 """
51 return self.xml_caption_to_srt(self.xml_captions)
52
53 @staticmethod
~/anaconda3/envs/myenv/lib/python3.6/site-packages/pytube/captions.py in
xml_caption_to_srt(self, xml_captions)
81 except KeyError:
82 duration = 0.0
83 start = float(child.attrib["start"])
84 end = start + duration
85 sequence_number = i + 1 # convert from 0-indexed to 1.
KeyError: 'start'
This is a bug in the library itself. Everything below is done in pytube 11.01. In the captions.py file on line 76 replace:
for i, child in enumerate(list(root)):
to:
for i, child in enumerate(list(root.findall('body/p'))):
Then on line 83, replace:
duration = float(child.attrib["dur"])
to:
duration = float(child.attrib["d"])
Then on line 86, replace:
start = float(child.attrib["start"])
to:
start = float(child.attrib["t"])
If only the number of lines and time will be displayed but no subtitle text, replace line 77:
text = child.text or ""
to:
text = ''.join(child.itertext()).strip()
if not text:
continue
It worked for me, python 3.9, pytube 11.01. Good luck!