pythonweb-scrapingimdb

How do I web scrape the names of the production companies from IMDB website


I need to scrape the names of the Production Companies of some movies. I keep try by using the anchor tag a and the class in which the names are enclosed but it does not return the production companies.

URL : https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1

Here's the HTML part of the website that I want to scrape :

<section class="ipc-page-section ipc-page-section--base">
  <div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
    <ul>
      <li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
        <div class="ipc-metadata-list-item__content-container">
          <ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
            <li role="presentation" class="ipc-inline-list__item">
                <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
            </li>
            <li role="presentation" class="ipc-inline-list__item">
                <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
            </li>
          </ul>
        </div>
      </li>
    </ul>
  </div>
</section>

Here's, What I have tried :

import requests
from bs4 import BeautifulSoup

movie_url="https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1"
movie_page = requests.get(movie_url)
soup = BeautifulSoup(page.text, 'html.parser')

#movies_comp = soup.find_all("li", class_="ipc-inline-list__item")
movies_comp = soup.find_all("a", class_="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link")

print(movies_comp)

I am not getting desirable output. What I am expecting it to return output is like:

['IDT Entertainment', 'New Arc Entertainment']

Solution

  • Here's what you can try :

    import requests
    
    from bs4 import BeautifulSoup
    
    page=requests.get("https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1")
    
    page="""
    <section class="ipc-page-section ipc-page-section--base">
      <div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
        <ul>
          <li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
            <div class="ipc-metadata-list-item__content-container">
              <ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
                <li role="presentation" class="ipc-inline-list__item">
                    <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
                </li>
                <li role="presentation" class="ipc-inline-list__item">
                    <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
                </li>
              </ul>
            </div>
          </li>
        </ul>
      </div>
    </section>
    """
    
    soup=BeautifulSoup(page,"lxml")
    
    # To understand this is then structur of the data you want to extract :
    # <li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies">
        # <ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation"><li role="presentation" class="ipc-inline-list__item"><a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">
            # <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
            # <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
    
    print([a.text for a in soup.find("li",attrs={'class':r'ipc-metadata-list__item ipc-metadata-list-item--link','data-testid':r'title-details-companies'})
                                    .find("ul",class_="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base")
                                        .find_all("a")])
    

    Output :

    ['IDT Entertainment', 'New Arc Entertainment']
    

    There are <a> with that class so, you are getting multiple of them.