python-3.xweb-scrapinglxml.html

How to scrape the html page that provides more information while scrolling down by using python lxml


I am scraping the text from https://www.basketball-reference.com/players/p/parsoch01.html. But I cannot scrape the contents that is located below the "Total" table in the page. I want to get the number from "Total" and "Advanced" table but the code returns nothing. It seems that the page loads additional information as the user scroll down the page.

I ran the code below and succeeded to get the data from player's profile section and "Per Game" table. But cannot get value from "Total" table.

from lxml import html
import urllib
playerURL=urllib.urlopen("https://www.basketball-reference.com/players/p/parsoch01.html")
# Use xpath to parse points per game.
ppg=playerPage.xpath('//tr[@id="per_game.2019"]//td[@data-stat="pts_per_g"]//text()')[0]# succeed to get the value
total=playerPage.xpath('//tr[@id="totals.2019"]//td[@data-stat="fga"]//text()')// I expect 182 to be returned but nothing is returned.

Is there any way to get data from the lower part of this page?


Solution

  • It's because the content you wanna extract from that site is within comments. BeautifulSoup can't parse content from comments. To get the result you need to uncomment first so that BeautifulSoup can access it. The following script does exactly what I tried to say:

    import requests
    from bs4 import BeautifulSoup
    
    URL = "https://www.basketball-reference.com/players/p/parsoch01.html"
    
    r = requests.get(URL).text
    #kick out the comment signs from html elements so that BeautifulSoup can access them
    comment = r.replace("-->", "").replace("<!--", "")
    soup = BeautifulSoup(comment,"lxml")
    total = soup.select_one("[id='totals.2019'] > [data-stat='fga']").text
    print(total)
    

    Output:

    182