[SOLVED] ESPN Gamecast Python Webscraping

ESPN Gamecast Python Webscraping

I am having trouble scraping ESPN Gamecast links from the espn scoreboard webpage. I have tried:

site = "https://www.espn.com/mlb/scoreboard"

html = requests.get(site).text

soup = BeautifulSoup(html, 'html.parser').find_all('a')

links = [link.get('href') for link in soup]

but the links are not being recognized.

Solution

It's loaded dynamically so you need to either a) use somethinging like Selenium that allows the page to render before parsing with bs4, or b) go straight to the data source/api. Api is often the best option:

import requests

api = 'http://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard'

jsonData = requests.get(api).json()
events = jsonData['events']

links = []
for event in events:
    event_links = event['links']
    for each in event_links:
        if each['text'] == 'Gamecast':
            links.append(each['href'])

Ouput:

print(links)
['http://www.espn.com/mlb/game/_/gameId/401228229', 'http://www.espn.com/mlb/game/_/gameId/401228235', 'http://www.espn.com/mlb/game/_/gameId/401228242', 'http://www.espn.com/mlb/game/_/gameId/401228240', 'http://www.espn.com/mlb/game/_/gameId/401228233', 'http://www.espn.com/mlb/game/_/gameId/401228234', 'http://www.espn.com/mlb/game/_/gameId/401228239', 'http://www.espn.com/mlb/game/_/gameId/401228237', 'http://www.espn.com/mlb/game/_/gameId/401228231', 'http://www.espn.com/mlb/game/_/gameId/401228232', 'http://www.espn.com/mlb/game/_/gameId/401228236', 'http://www.espn.com/mlb/game/_/gameId/401228230', 'http://www.espn.com/mlb/game/_/gameId/401228238', 'http://www.espn.com/mlb/game/_/gameId/401228243', 'http://www.espn.com/mlb/game/_/gameId/401228241']