pythonselenium-webdriverweb-scrapingrpa

New York Times news scraping using pure python and selenium(via rpaframework)


Im trying to scrap New York Times search result using pure python and selenium(via rpaframework) but I'm not getting it correct. I need to get the title, date, and description. Here is my code so far

When I print the title I'm getting this error

selenium.common.exceptions.InvalidArgumentException: Message: unknown variant //h4[@class='css-2fgx4k'], expected one of css selector, link text, partial link text, tag name, xpath at line 1 column 37

from RPA.Browser.Selenium import Selenium

# Search term
search_term = "climate change"

# Open the NY Times search page and search for the term
browser = Selenium()
browser.open_available_browser("https://www.nytimes.com/search?query=" + search_term)

# Find all the search result articles
articles = browser.find_elements("//ol[@data-testid='search-results']/li")


# Extract title, date, and description for each article and add to the list
for article in articles:
    # Extract the title
    title = article.find_element("//h4[@class='css-2fgx4k']")
    print(title)


# Close the browser window
browser.close_all_browsers()

Any assistance will appreciate.


Solution

  • In full disclosure, I'm the author of the Browserist package. Browserist is lightweight, less verbose extension of the Selenium web driver that makes browser automation even easier. Simply install the package with pip install browserist and try this:

    from browserist import Browser
    from selenium.webdriver.common.by import By
    
    search_term = "climate"
    
    # with Browser() as browser:
        browser.open.url("https://www.nytimes.com/search?query=" + search_term)
        search_result_elements = browser.get.elements("//ol[@data-testid='search-results']/li")
        for element in search_result_elements:
            try:
                title = element.find_element(By.TAG_NAME, "h4").text
                print(title)
            except:
                pass
    

    Notes:

    from browserist import Browser, BrowserType, BrowserSettings
    
    ...
    
    with Browser(BrowserSettings(type=BrowserType.FIREFOX)) as browser:
    

    Here's what I get, and I hope you find it useful. Let me know if you have any questions?

    Titles printed in the terminal

    Results from NY Times