python-newspapernewspaper3k

Newspaper3k scrape several websites


I want to get articles from several websites. I tried this but I don't know what I have to do next

lm_paper = newspaper.build('https://www.lemonde.fr/')
parisien_paper = newspaper.build('https://www.leparisien.fr/')

papers = [lm_paper, parisien_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()

Solution

  • Below is the way you can use newspaper news_pool. I did note that the processing time for news_pool is time intensive, because it takes minutes to start printing titles. I believe that this time lag is related to the articles being downloaded in the background. I'm unsure how to speed this process up using Newspaper.

    import newspaper
    from newspaper import Config
    from newspaper import news_pool
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    lm_paper = newspaper.build('https://www.lemonde.fr/', config=config, memoize_articles=False)
    parisien_paper = newspaper.build('https://www.leparisien.fr/', config=config, memoize_articles=False)
    french_papers = [lm_paper, parisien_paper]
    
    # this setting is adjustable 
    news_pool.config.number_threads = 2
    
    # this setting is adjustable 
    news_pool.config.thread_timeout_seconds = 1
    
    news_pool.set(french_papers)
    news_pool.join()
    
    for source in french_papers:
    for article_extract in source.articles:
        if article_extract:
            article_extract.parse()
            print(article_extract.title)