pythonpython-3.xweb-scrapingpython-newspapernewspaper3k

Python Newspapers3k Newspapers library mutithreading hangs indefinitely


I'm working on a project to extract articles from gaming media sites, and I'm doing a basic test run, which according to VSCode's debugger consistently hangs at the point after which I've set up a multi-threaded extraction (changing the number of threads does not help) on two sites. I'm honestly not sure what I'm doing wrong here; I followed the examples that have been laid out. One of the sites, Gamespot, is even used in someone's tutorial, and I tried removing the other (Polygon) and it doesn't seem to help. I've created a virtual environment and tried this with both Python 3.8 and 3.7. All dependencies appear to be satisfied; I also tested in in repl dot it and got the same hang.

I would love to hear I'm just doing something wrong so I can fix it; I really want to do some data science on these specific websites and their articles! But it seems as if, at least for an OS X user, there's some sort of bug with multithreading. Here's my code:

#import system functions
import sys
import requests
sys.path.append('/usr/local/lib/python3.8/site-packages/')
#import basic HTTP handling processes
#import urllib
#from urllib.request import urlopen
#import scraping libraries

#import newspaper and BS dependencies

from bs4 import BeautifulSoup
import newspaper
from newspaper import Article 
from newspaper import Source 
from newspaper import news_pool

#import broad data libraries
import pandas as pd

#import gaming related news sources as newspapers
gamespot = newspaper.build('https://www.gamespot.com/news', memoize_articles=False)
polygon = newspaper.build('https://www.polygon.com/gaming', memoize_articles=False)

#organize the gaming related news sources using a list
gamingPress = [gamespot, polygon]
print("About to set the pool.")
#parallel process these articles using multithreading (store in mem)
news_pool.set(gamingPress, threads_per_source=4)
print("Setting the pool")
news_pool.join()
print("Pool set")
#create the interim pandas dataframe based on these sources
final_df = pd.DataFrame()

#a limit on sources could be placed here; intentionally I have placed none
limit = 10

for source in gamingPress:
    #these are temporary placeholder lists for elements to be extracted
    list_title = []
    list_text = []
    list_source = []

    count = 0

    for article_extract in source.articles:
        article_extract.parse()
        
        #further limit functionality could be placed here; not placed
        if count > limit:
            break

        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.apprend(article_extract.source_url)

        print(count)
        count +=1 #progress the loop *via* count

    temp_df = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append this to the final DataFrame
    final_df = final_df.append(temp_df, ignore_index=True)

#export to CSV, placeholder for deeper analysis/more limited scope, may remain
final.df.to_csv('gaming_press.csv')

and here's what I get back when I finally give up and hit the interrupt at console:


About to set the pool.
Setting the pool
^X^X^CTraceback (most recent call last):
  File "scraper1.py", line 31, in <module>
    news_pool.join()
  File "/usr/local/lib/python3.8/site-packages/newspaper3k-0.3.0-py3.8.egg/newspaper/mthreading.py", line 103, in join
    self.pool.wait_completion()
  File "/usr/local/lib/python3.8/site-packages/newspaper3k-0.3.0-py3.8.egg/newspaper/mthreading.py", line 63, in wait_completion
    self.tasks.join()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/queue.py", line 89, in join
    self.all_tasks_done.wait()
  File "/usr/local/Cellar/python@3.8/3.8.5/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 302, in wait
    waiter.acquire()
KeyboardInterrupt

Solution

  • I decided to look into Newspaper mutithreading issue. I looked at the source code for Newspaper on github and devised this answer. In my testing I was able to obtain the article titles.

    It seems that this processing is time intensive, because it takes on average 6 minutes. After doing some more research it looks like the time lag is directly related to the articles being downloaded in the background. I'm unsure how to speed this process up using Newspaper.

    import newspaper
    from newspaper import Config
    from newspaper import news_pool
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    gamespot = newspaper.build('https://www.gamespot.com/news', config=config, memoize_articles=False)
    polygon = newspaper.build('https://www.polygon.com/gaming', config=config, memoize_articles=False)
    
    gamingPress = [gamespot, polygon]
    
    # this setting is adjustable 
    news_pool.config.number_threads = 2
    
    # this setting is adjustable 
    news_pool.config.thread_timeout_seconds = 2
    
    news_pool.set(gamingPress)
    news_pool.join()
    
    for source in gamingPress:
      for article_extract in source.articles:
        article_extract.parse()
        print(article_extract.title)
    

    To be honest, I'm trying to determine the benefit of using news_pool. It seems from the comments in the source code of Newspaper that news_pool primary purpose is related to connection rate-limiting. I also noted that several attempts have been made to improve the threading model, but those code updates haven't been pushed into the production code.

    Nevertheless... The answer below starts processing in under 1 minute and it doesn't use news_pool. More testing needs to be done to see if a source rate limits the connections or other issues arise.

    import newspaper
    from newspaper import Config
    from newspaper import news_pool
    
    USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
    
    config = Config()
    config.browser_user_agent = USER_AGENT
    config.request_timeout = 10
    
    gamespot = newspaper.build('https://www.gamespot.com/news', config=config, memoize_articles=False)
    polygon = newspaper.build('https://www.polygon.com/gaming', config=config, memoize_articles=False)
    gamingPress = [gamespot, polygon]
    for source in gamingPress:
       source.download_articles()
       for article_extract in source.articles:
          article_extract.parse()
          print(article_extract.title)
    

    Concerning the news_pool code section. For some reason I noted redundant article titles in my limited testing of your target sources.