python-3.xcsvweb-scrapingnewspaper3k

Newspaper3k export to csv on first row only


With the help of 'Life is complex' I have managed to scrape data from CNN newswebsite. The data (URLs) extracted from are saved in a .csv file (test1). Note this had been done manually as it was easier to do!

from newspaper import Config
from newspaper import Article
from newspaper import ArticleException
import csv

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

with open('test1.csv', 'r') as file:
    csv_file = file.readlines()
    for url in csv_file:
        try:
            article = Article(url.strip(), config=config)
            article.download()
            article.parse()
            print(article.title)
            article_text = article.text.replace('\n', ' ')
            print(article.text)
        except ArticleException:
            print('***FAILED TO DOWNLOAD***', article.url)
            
with open('test2.csv', 'a', newline='') as csvfile:
    headers = ['article title', 'article text']
    writer = csv.DictWriter(csvfile, lineterminator='\n', fieldnames=headers)
    writer.writeheader()
    writer.writerow({'article title': article.title,
                     'article text': article.text})

With the code above I manage to scrape the actual news information (title and content) from the URLs and also to export it to a .csv file. Only the issue with the export is, is that it only exports the last title and text (therefore I think it keeps overwriting the info on the first row)

How can I get all the titles and content in the csv file?


Solution

  • Thanks for giving me a shout out.

    The code below should help you solve your CSV write issue. If it doesn't just let me know and I will rework my answer.

    P.S. I will update my Newspaper3k overview document with more details on writing CSV files.

    P.P.S. I'm current writing a new news scraper, because the development for Newspaper3k is dead. I'm unsure of the release date of my code.

    import csv
    from newspaper import Config
    from newspaper import Article
    from os.path import exists
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    urls = ['https://www.cnn.com/2021/10/25/tech/facebook-papers/index.html', 'https://www.cnn.com/entertainment/live-news/rust-shooting-alec-baldwin-10-25-21/h_257c62772a2b69cb37db397592971b58']
    for url in urls:
        article = Article(url, config=config)
        article.download()
        article.parse()
        article_meta_data = article.meta_data
    
        published_date = {value for (key, value) in article_meta_data.items() if key == 'pubdate'}
        article_published_date = " ".join(str(x) for x in published_date)
    
        file_exists = exists('cnn_extraction_results.csv')
        if not file_exists:
            with open('cnn_extraction_results.csv', 'w', newline='') as file:
                headers = ['date published', 'article title', 'article text']
                writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
                writer.writeheader()
                writer.writerow({'date published': article_published_date,
                                 'article title': article.title,
                                 'article text': article.text})
        else:
            with open('cnn_extraction_results.csv', 'a', newline='') as file:
                headers = ['date published', 'article title', 'article text']
                writer = csv.DictWriter(file, delimiter=',', lineterminator='\n', fieldnames=headers)
                writer.writerow({'date published': article_published_date,
                                 'article title': article.title,
                                 'article text': article.text})