Okay, I thought I was crazy because I repeatedly failed at this, but I thought, maybe something is happening with the html that I don't understand.
I have been trying to scrape the 'articles' from cnn.com.
But no matter which way I tried soup.find_all('articles'), or soup.find('body').div('div')...etc with class tags, id, etc. FAIL.
I found this reference: Webscraping from React web application after componentDidMount.
I suspect injection in html is why I am having issues.
I know 0 about injection other than 'html injection attacks' from cyber security reading.
I want the articles, but I am assuming I will need to use a tactic similar to the other stack overflow question link above. I do not know how. Links to help documents or specifically cnn scraping would be appreciated.
Or if someone knows how I could get the 'full data' of the html body element, so that I could do some rearranging in my early code of this definition and then just reassign body.
'Or just tell me I'm an idiot and on the wrong track'
def build_art_d(site):
url = site
main_l = len(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
art_dict = {}
body = soup.find('body')
print(body.prettify())
div1 = body.find('div', {'class':'pg-no-rail pg-wrapper'})
section = div1.find('section',{'id' : 'homepage1-zone-1'})
div2 = section.find('div', {'class':'l-container'})
div3 = div2.find('div', {'class':'zn__containers'})
articles = div3.find_all('article')
for art in articles:
art_dict[art.text] = art.href
#test print
for article in art_dict:
print('Article :: {}'.format(article), 'Link :: {}'.format(art_dict[article]))
You can use selinium to enable the data to be filled in by the sites javascript. Then use your existing bs4 code to scrape the articles.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.cnn.com/')
soup = BeautifulSoup(driver.page_source, 'html.parser')