I am trying to get the html from CNN for a personal project. I am using the requests library and am new to it. I have followed basic tutorials to get the HTML from CNN using requests, but keep getting responses that are different from the HTML I find when I inspect the webpage from my browser. Here is my code:
base_url = 'https://www.cnn.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
I am trying to get article titles from CNN, but this is my first issue. Thanks for the help!
Update It seems that I know even less than I had initially assumed. My real question is: How do I extract titles from the CNN homepage? I've tried both answers, but the HTML from requests does not contain title information. How can I get the title information like what is in this picture (Screenshot of my browser)Screenshot of cnn article title with accompanying html side by side
You can use Selenium ChromeDriver
to scrape https://cnn.com
.
import bs4 as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome("---CHROMEDRIVER-PATH---", options=chrome_options)
driver.get('https://cnn.com/')
soup = bs.BeautifulSoup(driver.page_source, 'lxml')
# Get Titles from HTML.
titles = soup.find_all('span', {'class': 'cd__headline-text'})
print(titles)
# Close ChromeDriver.
driver.close()
driver.quit()
Output:
[<span class="cd__headline-text"><strong>The West turned Aung San Suu Kyi into a saint. She was always going to disappoint </strong></span>, <span class="cd__headline-text"><strong>In Hindu-nationalist India, Muslims risk being branded infiltrators</strong></span>, <span class="cd__headline-text">Johnson may have stormed to victory, but he's got a problem</span>, <span class="cd__headline-text">Impeachment heads to full House after historic vote</span>, <span class="cd__headline-text">Supreme Court to decide on Trump's financial records</span>, <span class="cd__headline-text">Michelle Obama's message for Thunberg after Trump mocks her</span>, <span class="cd__headline-text">Actor Danny Aiello dies at 86</span>, <span class="cd__headline-text">The biggest risk at the North Pole isn't what you think</span>, <span class="cd__headline-text">US city declares state of emergency after cyberattack </span>, <span class="cd__headline-text">Reality TV show host arrested</span>, <span class="cd__headline-text">Big names in 2019 you may have mispronounced</span>, <span class="cd__headline-text"><strong>Morocco has Africa's 'first fully solar village'</strong></span>]
You can download ChromeDriver from here.