web-scrapingbeautifulsoupcss-selectorsyahoo-finance

Yahoo finance beautifulsoup artifact parsing


The following correct & useful answer was provided to question How to filter on this artifact in the HTML?

from bs4 import BeautifulSoup
import requests  
page = requests.get("https://finance.yahoo.com/quote/GOOGL?p=GOOGL")  
soup = BeautifulSoup(page.content, 'html.parser')  
soup.select_one('fin-streamer[data-symbol="GOOGL"]')['value']

I find the same fin-streamer[data-symbol="GOOGL"] on https://finance.yahoo.com/quote/GOOGL/key-statistics but adjusting the code (below) does not work for that page

from bs4 import BeautifulSoup
import requests
page = requests.get("https://finance.yahoo.com/quote/GOOGL/key-statistics")
soup = BeautifulSoup(page.content, 'html.parser')
soup.select_one('fin-streamer[data-symbol="GOOGL"]')['value']

I get this:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-3b1c0b9e0480> in <module>
      3 page = requests.get("https://finance.yahoo.com/quote/GOOGL/key-statistics")
      4 soup = BeautifulSoup(page.content, 'html.parser')
----> 5 soup.select_one('fin-streamer[data-symbol="GOOGL"]')['value']

TypeError: 'NoneType' object is not subscriptable

Could you help me find out why?

The reason for my request is that I am seeking to avoid having to load / parse the quote page that is otherwise not needed or useful compared to the key statistics page.


Solution

  • Always and first of all, take a look at your soup to see if all the expected ingredients are in place.

    You have to add a user-agent to your request headers to get the right source back from the server:

    page = requests.get("https://finance.yahoo.com/quote/GOOGL/key-statistics", headers={'user-agent':'some-agent'})
    

    Example

    from bs4 import BeautifulSoup
    import requests
    page = requests.get("https://finance.yahoo.com/quote/GOOGL/key-statistics", headers={'user-agent':'some-agent'})
    soup = BeautifulSoup(page.content, 'html.parser')
    soup.select_one('fin-streamer[data-symbol="GOOGL"]')['value']