pythonseleniumweb-scraping

How to use Selenium+BeautifulSoup to get data from dynamically created elements


First question on StackOverFlow. I am trying to web scrape fxstreet.com/news. It seems that their news feed is dynamically producing articles. BeautifulSoup is unable to gather that information, so I have decided to use Selenium. However, I am having trouble using Selenium to access the articles that are displayed.

import requests
from bs4 import BeautifulSoup
import re
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0')

article = driver.find_element_by_link_text('/news')
for post in article:
    print(post.text)

I would like to make a scraper that checks periodically for new articles, these articles would have a URL of: https://www.fxstreet.com/news...(endpoint).

However, when I try to look up hrefs/'a' tag, I get many links throughout the website, but none of them are the news articles featured in the live feed. When I look up every single 'div' I get the whole html laid out for me:

                    <article class="fxs_entriesList_article_with_image ">
                    <h3 class="fxs_entryHeadline">
                        <a href="https://www.fxstreet.com/news/gbp-usd-upside-potential-limited-in-covid-19-uncertainties-202004021808" title="GBP/USD upside potential limited in COVID-19 uncertainties">GBP/USD upside potential limited in COVID-19 uncertainties</a>
                    </h3>
                    <address class="fxs_entry_metaInfo">
                        <span class="fxs_article_author">
                            By <a href="/author/ross-j-burland" rel="nofollow">Ross J Burland</a>
                        </span> | <time pubdate="" datetime="">18:08 GMT</time>
                    </address>
                </article>

telling me that it exists somewhere, somehow, but I am completely unable to interact with it. So how do I access the links that I need when Selenium is unable to search for 'a' tags, or partial links? I have also tried to look for the exact link using:

elem = driver.find_elements_partial_link("news")

for element in elem:
       print(element.get_attribute("innerHTML"))

To no avail. I have also tried putting explicit and implicit waits. Thanks.


Solution

  • Please use the below css to get all the news related links.

    h4.fxs_headline_tiny a
    

    additional imports needed for explicit wait.

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

    Your code should be like the below.

    url = "https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0"
    driver.get(url)
    WebDriverWait(driver,120).until(EC.presence_of_element_located((By.CSS_SELECTOR,"h4.fxs_headline_tiny a")))
    news_elems = driver.find_elements_by_css_selector("h4.fxs_headline_tiny a")
    for ele in news_elems:
        print(ele.get_attribute('href'))