python web-scraping beautifulsoup m3u8 m3u

How to programmatically download a m3u8 video referenced in a blob in Python?

Note that this question is different from How do we download a blob url video [closed] in that it requires no human interaction with the browser.

I have the following problem:

I have a list of URLs. They point to HTML pages that have the same underlying structure.
There's a image in the middle of the page; when it's clicked, it loads a player.
The player as a blob references to a m3u8 playlist though this is not visible in the HTML itself (it's visible in the Network tab of Chrome).
The player streams a short video.

What I need to do:

Programmatically access the various URLs. Get the HTML and click on the image-player.
Get the blob reference and use that one to get the m3u8 playlist.
Download the stream as a video (bonus points for downloading it as a gif).

Note that the solution would require no human interaction with the browser. API-wise, the input should be a list of URLs and the output a list of videos/gifs.

An example page can be found here in case you want to test your solution.

My understanding is that I can use Selene to get the HTML and click on the image to start the player. However, I have no idea how to process the blob to get the m3u8 and then use that one for the actual video.

Solution

With a little digging, you don't need to click any buttons. When you click the buttons it calls for the master.m3u8 file. Using dev tools you can piece together the requested url. The thing is, that first file doesn't contain the links to the actual video. You piece together another request to get the final m3u8 file. From there, you can use the other SO links to download the video. It's segmented so it's not straightforward download. You can uncomment the print statements below see what each m3u8 file contains. This will loop through the pages as well

 import re
 for i in range(6119, 6121):
    url = 'https://www2.nhk.or.jp/signlanguage/sp/enquete.cgi?dno={}'.format(str(i))
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    print(soup.find(onclick=re.compile('signlanguage/movie'))) # locate the div that has the data we need

    video_id = soup.find(onclick=re.compile('signlanguage/movie')).get('onclick').split(',')[1].replace("'","")
    m3u8_url = 'https://nhks-vh.akamaihd.net/i/signlanguage/movie/v4/{}/{}.mp4/master.m3u8'.format(video_id[-1], video_id)
    # this m3u8 file doesn't contain download links, the next one does; so download and save that one
    r = requests.get(m3u8_url)
    # print(r.text)
 
    m3u8_url_2 = r.text.split('\n')[2] # get first link; high bandwidth
    r2 = requests.get(m3u8_url_2)
    # print(r2.text)
        
    # there are other ways to download the file, i'm just creating a new one with the data read and writing to a file
    fn = video_id + '.m3u8'
    with open(fn, 'w+') as f:
        f.write(r2.text)
        f.close()