pythonhtmlweb-scrapingbeautifulsoupimdb

IMDb webscraping for the top 250 movies using Beautifulsoup


I know that there are many similar questions here already, but none of them gives me a satisfying answer for my problem. So here it is:

We need to create a dataframe from the top 250 movies from IMDb for an assignment. So we need to scrape the data first using BeautifulSoup.

These are the attributes that we need to scrape:

IMDb id (0111161)
Movie name (The Shawshank Redemption)
Year (1994)
Director (Frank Darabont)
Stars (Tim Robbins, Morgan Freeman, Bob Gunton)
Rating (9.3)
Number of reviews (2.6M)
Genres (Drama)
Country (USA)
Language (English)
Budget ($25,000,000)
Gross box Office Revenue ($28,884,504)

So far, I have managed to get only a few of them. I received all the separate URLs for all the movies, and now I loop over them. This is how the loop looks so far:

for x in np.arange(0, len(top_250_links)):
    url=top_250_links[x]
    req = requests.get(url)
    page = req.text
    soup = bs(page, 'html.parser')
    
    # ID
    
    # Movie Name
    Movie_name=(soup.find("div",{'class':"sc-dae4a1bc-0 gwBsXc"}).get_text(strip=True).split(': ')[1])
    
    # Year
    year =(soup.find("a",{'class':"ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"}).get_text())
    
    # Length
    
    
    # Director
    director = (soup.find("a",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
    
    # Stars
    stars = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
    
    
    # Rating
    rating = (soup.find("span",{'class':"sc-7ab21ed2-1 jGRxWM"}).get_text())
    rating = float(rating)
        
    # Number of Reviews
    reviews = (soup.find("span",{'class':"score"}).get_text())
    reviews = reviews.split('K')[0]
    reviews = float(reviews)*1000
    reviews = int(reviews)
    
    # Genres
    genres = (soup.find("span",{'class':"ipc-chip__text"}).get_text())

    # Language
    
    
    # Country
    
    
    # Budget
    meta = (soup.find("div" ,{'class':"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"}))
    
    
    # Gross box Office Revenue
    gross = (soup.find("span",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
    
    # Combine
    movie_dict={
        'Rank':x+1,
        'ID': 0,
        'Movie Name' : Movie_name,
        'Year' : year,
        'Length' : 0,
        'Director' : director,
        'Stars' : stars,
        'Rating' : rating,
        'Number of Reviewes' : reviews,
        'Genres' : genres,
        'Language': 0,
        'Country': 0,
        'Budget' : 0,
        'Gross box Office Revenue' :0}
    
    df = df.append(pd.DataFrame.from_records([movie_dict],columns=movie_dict.keys() ) )

I can't find a way to obtain the missing information. If anybody here has experience with this kind of topic and might be able to share his thoughts, it would help a lot of people. I think the task is not new and has been solved hundreds of times, but IMDb changed the classes and the structure in their HTML.

Thanks in advance.


Solution

  • BeautifulSoup has many functions to search elements. it is good to read all documentation

    You can create more complex code using many .find() with .parent, etc.

    soup.find(text='Language').parent.parent.find('a').text
    

    For some elements you can also use data-testid="...."

    soup.find('li', {'data-testid': 'title-details-languages'}).find('a').text
    

    Minimale working code (for The Shawshank Redemption)

    import requests
    from bs4 import BeautifulSoup as BS
    
    url = 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1a264172-ae11-42e4-8ef7-7fed1973bb8f&pf_rd_r=A453PT2BTBPG41Y0HKM8&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
    
    response = requests.get(url)
    soup = BS(response.text, 'html.parser')
    
    print('Language:', soup.find(text='Language').parent.parent.find('a').get_text(strip=True))
    print('Country of origin:', soup.find(text='Country of origin').parent.parent.find('a').get_text(strip=True))
    
    for name in ('Language', 'Country of origin'):
        value = soup.find(text=name).parent.parent.find('a').get_text(strip=True)
        print(name, ':', value)
    
    print('Language:', soup.find('li', {'data-testid':'title-details-languages'}).find('a').get_text(strip=True))
    print('Country of origin:', soup.find('li', {'data-testid':'title-details-origin'}).find('a').get_text(strip=True))
    
    for name, testid in ( ('Language', 'title-details-languages'), ('Country of origin', 'title-details-origin')):    
        value = soup.find('li', {'data-testid':testid}).find('a').get_text(strip=True)
        print(name, ':', value)
    

    Result:

    Language: English
    Country of origin: United States
    
    Language : English
    Country of origin : United States
    
    Language: English
    Country of origin: United States
    
    Language : English
    Country of origin : United States