pythonhtmlweb-scrapingbeautifulsouphtml-injections

Webscrape CNN, injection, beautiful soup, python, requests, HTML


Okay, I thought I was crazy because I repeatedly failed at this, but I thought, maybe something is happening with the html that I don't understand.

I have been trying to scrape the 'articles' from cnn.com.

But no matter which way I tried soup.find_all('articles'), or soup.find('body').div('div')...etc with class tags, id, etc. FAIL.

I found this reference: Webscraping from React web application after componentDidMount.

I suspect injection in html is why I am having issues.

I know 0 about injection other than 'html injection attacks' from cyber security reading.

I want the articles, but I am assuming I will need to use a tactic similar to the other stack overflow question link above. I do not know how. Links to help documents or specifically cnn scraping would be appreciated.

Or if someone knows how I could get the 'full data' of the html body element, so that I could do some rearranging in my early code of this definition and then just reassign body.

'Or just tell me I'm an idiot and on the wrong track'

def build_art_d(site):
            
    url = site
    main_l = len(url)
    
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    

    print(soup.prettify())
    
    art_dict = {}
    
    body = soup.find('body')
    print(body.prettify())
    div1 = body.find('div', {'class':'pg-no-rail pg-wrapper'})
    section = div1.find('section',{'id' : 'homepage1-zone-1'})
    div2 = section.find('div', {'class':'l-container'})
    div3 = div2.find('div', {'class':'zn__containers'})
    articles = div3.find_all('article')
    
    for art in articles:
        art_dict[art.text] = art.href
    
        
    #test print
    for article in art_dict:
        print('Article :: {}'.format(article), 'Link :: {}'.format(art_dict[article]))

Solution

  • You can use selinium to enable the data to be filled in by the sites javascript. Then use your existing bs4 code to scrape the articles.

    from bs4 import BeautifulSoup
    from selenium import webdriver
    
    driver = webdriver.Chrome()
    driver.get('https://www.cnn.com/')
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')