Please, i need someone help me. I can't understand why I only crawl 25 movies instead of 250. My code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
url = "https://www.imdb.com/chart/top/?ref_=nv_mv_250"
response = requests.get(url, headers = headers)
html_doc = response.content
soup = BeautifulSoup(html_doc, "html.parser")
ls = soup.find_all("div", class_="sc-b189961a-0 hBZnfJ cli-children")
print(len(ls))
The result is 25. Link: https://www.imdb.com/chart/top/?ref_=nv_mv_250, this has 250 movies and I using BeautifulSoup. The result len(ls) should be 250. Please, explain and help me fix this. Thank you very much!
I hope I can crawl fully data on this Web
You need to extract the full list of movies from the JSON for Linking Data element. It's a JSON object from which the required information can be easily extracted.
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) Chrome/126.0.0.0 Safari/537.36',
'accept': 'text/html',
}
response = requests.get('https://www.imdb.com/chart/top/?ref_=nv_mv_250', headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
script = soup.select_one("script[type='application/ld+json']")
data = json.loads(script.text)
movies = []
for movie in data["itemListElement"]:
movies.append({k: movie["item"][k] for k in ["name", "url", "duration"]})
movies = pd.DataFrame(movies)
print(movies)
Sample output:
name url duration
0 The Shawshank Redemption https://www.imdb.com/title/tt0111161/ PT2H22M
1 The Godfather https://www.imdb.com/title/tt0068646/ PT2H55M
2 The Dark Knight https://www.imdb.com/title/tt0468569/ PT2H32M
3 The Godfather Part II https://www.imdb.com/title/tt0071562/ PT3H22M
4 12 Angry Men https://www.imdb.com/title/tt0050083/ PT1H36M
.. ... ... ...
245 It Happened One Night https://www.imdb.com/title/tt0025316/ PT1H45M
246 Aladdin https://www.imdb.com/title/tt0103639/ PT1H30M
247 Drishyam https://www.imdb.com/title/tt4430212/ PT2H43M
248 Dances with Wolves https://www.imdb.com/title/tt0099348/ PT3H1M
249 Gekijôban Kimetsu no yaiba: Mugen Ressha hen https://www.imdb.com/title/tt11032374/ PT1H57M
[250 rows x 3 columns]