pythonweb-scrapingbeautifulsouplxml.html

BeautifulSoup Scraping Results not showing


I am playing around with BeautifulSoup to scrape data from websites. So I decided to scrape empireonline's website for 100 greatest movies of all time.

Here's the link to the webpage: https://www.empireonline.com/movies/features/best-movies-2/

I imported the HTML from the site quite alright, I was able to use beautiful soup on it. But when I wanted to get the list of the 100 movie titles, I was getting an empty list. Here's the code I wrote below.

import requests
from bs4 import BeautifulSoup

URL = "https://www.empireonline.com/movies/features/best-movies-2/"

response = requests.get(URL)
top100_webpage = response.text

soup = BeautifulSoup(top100_webpage, "html.parser")
movies = soup.find_all(name="h3", class_="jsx-4245974604")
print(movies)

When I ran the code, the result was an empty list. I changed my parsing library to lxml and html5lib but I was still getting the same empty list.

Please how can I resolve this issue?


Solution

  • It's because in this page, the html tags you are looking for (the movie titles) are not in the original html page you request, but are added later by javascript. You can confirm this by loading the page in Chrome with js turned off, you will see the page without film titles.

    An alternative for this specific page could be to get the movie titles out of the review links, since the review links all seem to end with the movie title.

    BTW the SO question mentioned by @hedgehog in the question comment addresses the exact same problem. In the answers to that another solution is given by using Selenium to actually run javascript to generate the page as we see it in the browser.