pythonmultithreadingparsingnlppython-newspaper

Python: Newspaper Module - Any way to pool getting articles straight from URLs?


I'm using the Newspaper module for python found here.

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at the same time. (see the "Multi-threading article downloads" in the link above)

Is there any way to do this for pulling articles straight from a LIST of urls? That is, is there any way I can pump in multiple urls into the following set-up and have it download and parse them concurrently?

from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])

Solution

  • I was able to do this by creating a Source for each article URL. (disclaimer: not a python developer)

    import newspaper
    
    urls = [
      'http://www.baltimorenews.net/index.php/sid/234363921',
      'http://www.baltimorenews.net/index.php/sid/234323971',
      'http://www.atlantanews.net/index.php/sid/234323891',
      'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
    ]
    
    class SingleSource(newspaper.Source):
        def __init__(self, articleURL):
            super(StubSource, self).__init__("http://localhost")
            self.articles = [newspaper.Article(url=url)]
    
    sources = [SingleSource(articleURL=u) for u in urls]
    
    newspaper.news_pool.set(sources)
    newspaper.news_pool.join()
    
    for s in sources:
      print s.articles[0].html