I am learning how to scrape websites using the module Beautiful Soup 4. I am trying to scrape a cricket league table and so far have used the following code.
#We want to scrape the cricinfo website for the league table
import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.espncricinfo.com/table/series/8048/season/2020/indian-premier-league")
soup = bs(r.content)
headers = soup.find_all('h5')
print(headers)
This is the output I get once I run the code
[<h5 class="header-title label ">Indian Premier League 2020</h5>,
<h5 class="header-title label ">Mumbai Indians</h5>,
<h5 class="header-title label ">Royal Challengers Bangalore</h5>,
<h5 class="header-title label ">Delhi Capitals</h5>,
<h5 class="header-title label ">Sunrisers Hyderabad</h5>,
<h5 class="header-title label ">Kings XI Punjab</h5>,
<h5 class="header-title label ">Rajasthan Royals</h5>,
<h5 class="header-title label ">Kolkata Knight Riders</h5>,
<h5 class="header-title label ">Chennai Super Kings</h5>,
<h5 class="gray600">Standings are updated with the completion of each game</h5>]
What I would like to do now is to scrape this even further and get a list containing the team names and get rid of the top and bottom line
E.g. I would like the final list to be something like
teams = ['Mumbai Indians', 'Royal Challengers Bangalore', 'Delhi Capitals', 'Sunrisers Hyderabad'. 'Kings XI Punjab', 'Rajasthan Royals', 'Kolkata Knight Riders', 'Chennai Super Kings']
Any help would be greatly appreciated Thank You
You can use .string
to get the text contents of HTML elements. Try this:
teams = [header.string for header in headers]