I want to get articles from several websites. I tried this but I don't know what I have to do next
lm_paper = newspaper.build('https://www.lemonde.fr/')
parisien_paper = newspaper.build('https://www.leparisien.fr/')
papers = [lm_paper, parisien_paper]
news_pool.set(papers, threads_per_source=2) # (3*2) = 6 threads total
news_pool.join()
Below is the way you can use newspaper news_pool. I did note that the processing time for news_pool is time intensive, because it takes minutes to start printing titles. I believe that this time lag is related to the articles being downloaded in the background. I'm unsure how to speed this process up using Newspaper.
import newspaper
from newspaper import Config
from newspaper import news_pool
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
lm_paper = newspaper.build('https://www.lemonde.fr/', config=config, memoize_articles=False)
parisien_paper = newspaper.build('https://www.leparisien.fr/', config=config, memoize_articles=False)
french_papers = [lm_paper, parisien_paper]
# this setting is adjustable
news_pool.config.number_threads = 2
# this setting is adjustable
news_pool.config.thread_timeout_seconds = 1
news_pool.set(french_papers)
news_pool.join()
for source in french_papers:
for article_extract in source.articles:
if article_extract:
article_extract.parse()
print(article_extract.title)