Im trying to scrap New York Times search result using pure python and selenium(via rpaframework) but I'm not getting it correct. I need to get the title, date, and description. Here is my code so far
When I print the title I'm getting this error
selenium.common.exceptions.InvalidArgumentException: Message: unknown variant //h4[@class='css-2fgx4k']
, expected one of css selector
, link text
, partial link text
, tag name
, xpath
at line 1 column 37
from RPA.Browser.Selenium import Selenium
# Search term
search_term = "climate change"
# Open the NY Times search page and search for the term
browser = Selenium()
browser.open_available_browser("https://www.nytimes.com/search?query=" + search_term)
# Find all the search result articles
articles = browser.find_elements("//ol[@data-testid='search-results']/li")
# Extract title, date, and description for each article and add to the list
for article in articles:
# Extract the title
title = article.find_element("//h4[@class='css-2fgx4k']")
print(title)
# Close the browser window
browser.close_all_browsers()
Any assistance will appreciate.
In full disclosure, I'm the author of the Browserist package. Browserist is lightweight, less verbose extension of the Selenium web driver that makes browser automation even easier. Simply install the package with pip install browserist
and try this:
from browserist import Browser
from selenium.webdriver.common.by import By
search_term = "climate"
# with Browser() as browser:
browser.open.url("https://www.nytimes.com/search?query=" + search_term)
search_result_elements = browser.get.elements("//ol[@data-testid='search-results']/li")
for element in search_result_elements:
try:
title = element.find_element(By.TAG_NAME, "h4").text
print(title)
except:
pass
Notes:
climate
will yield more, yet relevant results, e.g. climate crisis
, but that's up to you to change.h4
tag header instead of the the CSS token value that might be changed over time.try
and except
clause.from browserist import Browser, BrowserType, BrowserSettings
...
with Browser(BrowserSettings(type=BrowserType.FIREFOX)) as browser:
Here's what I get, and I hope you find it useful. Let me know if you have any questions?