pythonjsonweb-scrapingbeautifulsoupimdb

How to scrape IMDB id / link from JustWatch?


Can we get the IMDB link from IMDB image of JustWatch website https://www.justwatch.com/in/movie/oppenheimer?

When I inspected the image elements of IMDB, there were no IMDB links.

This is JustWatch Website Image

However, when I clicked on it, it can open the IMDB link https://www.imdb.com/title/tt15398776/?ref_=justwatch.

This is IMDB Website Image

Is there any way of scraping the link that doesn't show up in inspect view by using python?

Thank you in advance.

This is my code which can get only rating

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

url = "https://www.justwatch.com/in/movie/oppenheimer"

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

soup = BeautifulSoup(webpage, 'html.parser')
soup.select('div.jw-scoring-listing__rating span span')[1]

Solution

  • urllib or requests do not work like a browser and so do not handle JavaScript or render things dynamically, but there is a way if the information is included in the static response to extract it.

    You could try to check the content of script elements with regex for the external imdbId:

    from urllib.request import Request, urlopen
    import re
    
    match = re.search(r"\"imdbId\":\s*\"([^\"]+)\"", str(webpage))
    
    if match:
        imdb_id_value = match.group(1)
        print(f'https://www.imdb.com/title/{imdb_id_value}/?ref_=justwatch')
    else:
        print('no imdbId found')
    

    That ends up in the following link if imdbId was found:

    https://www.imdb.com/title/tt15398776/?ref_=justwatch
    

    or in alterntive convert the content into JSON and treat it like a dict:

    ...
    json.loads(soup.select_one('script:-soup-contains("APOLLO_STATE")').text.strip('window.__APOLLO_STATE__='))
    ...