pythonweb-scrapingbeautifulsouppython-requests

Why does finding an element by a specific class result in empty outcome?


I was working on a web scraping project using Python, Requests, bs4 libraries.

I was trying to Scrape IPL's Webpage, where I want to get all details from the page for every match for every season.

Attached here a Snippet for your reference

Expected: tag length should be 60 because 60 matches were played! Actual: 0

Actual Result Snippet

from flask import Flask, render_template, request,jsonify
from flask_cors import CORS,cross_origin
import requests
from bs4 import BeautifulSoup as bs
from urllib.request import urlopen as uReq

#Main Web page URL
ipl_url = "https://www.iplt20.com/matches/results/2008"
response = requests.get(ipl_url)
if response.status_code == 200:
  html_content = response.text
  soup = bs(html_content, 'html.parser')

else:
  print(f'Failed to retrieve the web page. Status code: {response.status_code}')

#HERE THE PROBLEM STARTS
match_center = soup.find_all('div', {'class':'vn-shedule-desk col-100 floatLft'})
len(match_center) # ==> Expected: 60 , Actual: 0

#got the HTML parser using 'bs' But when I try to find
#'div', {'class':'vn-shedule-desk col-100 floatLft'} this tag then I get an empty list

Solution

  • As mentioned by @RobbyCornelissen the content is loaded dynamically via JavaScript and is not present in the static reponse from the server. selenium is an option to mimic a browsers behavior and interact with the website like a human would do and get your results.

    But there is no need to use selenium mandatory and also a solution to go with python-requests.

    Use the JavaScript files, that are loaded and contain all details from following urls:

    Iterate the results from first one to get seasons/competitions and use these to load the matches per each.

    Finally create a dataframe and analyse it for your needs.

    import requests
    import json
    import re
    import pandas as pd
    pd.set_option('display.max_columns', None)
    pd.set_option('display.float_format', str)
    pd.set_option('max_colwidth', 500)
    
    url_competitions = 'https://scores.iplt20.com/ipl/mc/competition.js'
    response = requests.get(url_competitions)
    data = response.text
    
    
    match = re.search(r'oncomptetion\((\{.*\})\)', data, re.DOTALL)
    if match:
        json_data = match.group(1)
        data_dict = json.loads(json_data).get('competition')
    
        if data_dict:
            competitions = [d.get('CompetitionID') for d in data_dict]
    
        matchsummary_data = []
        for c in competitions:
            response = requests.get(f'https://ipl-stats-sports-mechanic.s3.ap-south-1.amazonaws.com/ipl/feeds/{c}-matchschedule.js')
            data = response.text
            match = re.search(r'MatchSchedule\((\{.*\})\)', data, re.DOTALL)
            if match:
                json_data = match.group(1)
                data_dict = json.loads(json_data).get('Matchsummary')
                matchsummary_data.extend(data_dict)
    
    df = pd.DataFrame(matchsummary_data)