I was working on a web scraping project using Python, Requests, bs4 libraries.
I was trying to Scrape IPL's Webpage, where I want to get all details from the page for every match for every season.
Attached here a Snippet for your reference
Expected: tag length should be 60 because 60 matches were played! Actual: 0
from flask import Flask, render_template, request,jsonify
from flask_cors import CORS,cross_origin
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq
#Main Web page URL
ipl_url = "https://www.iplt20.com/matches/results/2008"
response = requests.get(ipl_url)
if response.status_code == 200:
html_content = response.text
soup = bs(html_content, 'html.parser')
else:
print(f'Failed to retrieve the web page. Status code: {response.status_code}')
#HERE THE PROBLEM STARTS
match_center = soup.find_all('div', {'class':'vn-shedule-desk col-100 floatLft'})
len(match_center) # ==> Expected: 60 , Actual: 0
#got the HTML parser using 'bs' But when I try to find
#'div', {'class':'vn-shedule-desk col-100 floatLft'} this tag then I get an empty list
As mentioned by @RobbyCornelissen the content is loaded dynamically via JavaScript and is not present in the static reponse from the server. selenium
is an option to mimic a browsers behavior and interact with the website like a human would do and get your results.
But there is no need to use selenium
mandatory and also a solution to go with python-requests
.
Use the JavaScript files, that are loaded and contain all details from following urls:
Iterate the results from first one to get seasons/competitions and use these to load the matches per each.
Finally create a dataframe and analyse it for your needs.
import requests
import json
import re
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', str)
pd.set_option('max_colwidth', 500)
url_competitions = 'https://scores.iplt20.com/ipl/mc/competition.js'
response = requests.get(url_competitions)
data = response.text
match = re.search(r'oncomptetion\((\{.*\})\)', data, re.DOTALL)
if match:
json_data = match.group(1)
data_dict = json.loads(json_data).get('competition')
if data_dict:
competitions = [d.get('CompetitionID') for d in data_dict]
matchsummary_data = []
for c in competitions:
response = requests.get(f'https://ipl-stats-sports-mechanic.s3.ap-south-1.amazonaws.com/ipl/feeds/{c}-matchschedule.js')
data = response.text
match = re.search(r'MatchSchedule\((\{.*\})\)', data, re.DOTALL)
if match:
json_data = match.group(1)
data_dict = json.loads(json_data).get('Matchsummary')
matchsummary_data.extend(data_dict)
df = pd.DataFrame(matchsummary_data)