python-3.xbeautifulsouppython-requestspython-newspapernewspaper3k

News article extract using requests,bs4 and newspaper packages. why doesn't links=soup.select(".r a") find anything?. This code was working earlier


Objective: I am trying to download the news article based on the keywords to perform sentiment analysis.

This code was working a few months ago but now it returns a null value. I tried fixing the issue butlinks=soup.select(".r a") return null value.

import pandas as pd
import requests
from bs4 import BeautifulSoup
import string
import nltk
from urllib.request import urlopen
import sys
import webbrowser
import newspaper 
import time
from newspaper import Article

Company_name1 =[]
Article_number1=[]
Article_Title1=[]
Article_Authors1=[]
Article_pub_date1=[]
Article_Text1=[]
Article_Summary1=[]
Article_Keywords1=[]
Final_dataframe=[]

class Newspapr_pd:
    def __init__(self,term):
        self.term=term
        self.subjectivity=0
        self.sentiment=0
        self.url='https://www.google.com/search?q={0}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'.format(self.term)
    
    def NewsArticlerun_pd(self):
        response=requests.get(self.url)
        response.raise_for_status()
        #print(response.text)
        soup=bs4.BeautifulSoup(response.text,'html.parser')
        links=soup.select(".r a")
       
        numOpen = min(5, len(links))
        Article_number=0
        for i in range(numOpen):
            response_links = webbrower.open("https://www.google.com" + links[i].get("href"))
            
            
            
        #For different language newspaper refer above table 
            article = Article(response_links, language="en") # en for English 
            Article_number+=1
            
            print('*************************************************************************************')
            
            Article_number1.append(Article_number)
            Company_name1.append(self.term)

        #To download the article 
            try:

                article.download() 
                 #To parse the article 
                article.parse() 
                #To perform natural language processing ie..nlp 
                article.nlp() 
  
        #To extract title
                Article_Title1.append(article.title)

  
        #To extract text
                Article_Text1.append(article.text)

  
        #To extract Author name
                Article_Authors1.append(article.authors)

                
        #To extract article published date
                Article_pub_date1.append(article.publish_date)
                

                
        #To extract summary
                Article_Summary1.append(article.summary)
                

  
        #To extract keywords 
                Article_Keywords1.append(article.keywords)

            except:
                print('Error in loading page')
                continue
  
        for art_num,com_name,title,text,auth,pub_dt,summaries,keywds in zip(Article_number1,Company_name1,Article_Title1,Article_Text1,Article_Authors1,Article_pub_date1,Article_Summary1,Article_Keywords1):
            Final_dataframe.append({'Article_link_num':art_num, 'Company_name':com_name,'Article_Title':title,'Article_Text':text,'Article_Author':auth,
                                   'Article_Published_date':pub_dt,'Article_Summary':summaries,'Article_Keywords':keywds})
        
list_of_companies=['Amazon','Jetairways','nirav modi']

for i in list_of_companies:
    comp = str('"'+ i + '"')
    a=Newspapr_pd(comp)
    a.NewsArticlerun_pd()

Final_new_dataframe=pd.DataFrame(Final_dataframe)
Final_new_dataframe.tail()    

Solution

  • This is a very complex issue, because Google News continually changes their class names. Additionally Google will add various prefixes to article urls and throw in some hidden ad or social media tags.

    The answer below only addresses scraping articles from Google news. More testing is needed to determine how it works with a large amount of keywords and with Google News changing page structure.

    The Newspaper3k extraction is even more complex, because each article can have a different structure. I would recommend looking at my Newspaper3k Usage Overview document for details on how to design that part of your code.

    P.S. I'm current writing a new news scraper, because the development for Newspaper3k is dead. I'm unsure of the release date of my code.

    import requests
    import re as regex
    from bs4 import BeautifulSoup
    
    
    def get_google_news_article(search_string):
        articles = []
        url = f'https://www.google.com/search?q={search_string}&safe=active&tbs=qdr:w,sdb:1&tbm=nws&source=lnt&dpr=1'
        response = requests.get(url)
        raw_html = BeautifulSoup(response.text, "lxml")
        main_tag = raw_html.find('div', {'id': 'main'})
        for div_tag in main_tag.find_all('div', {'class': regex.compile('xpd')}):
            for a_tag in div_tag.find_all('a', href=True):
                if not a_tag.get('href').startswith('/search?'):
                    none_articles = bool(regex.search('amazon.com|facebook.com|twitter.com|youtube.com|wikipedia.org', a_tag['href']))
                    if none_articles is False:
                        if a_tag.get('href').startswith('/url?q='):
                            find_article = regex.search('(.*)(&sa=)', a_tag.get('href'))
                            article = find_article.group(1).replace('/url?q=', '')
                            if article.startswith('https://'):
                                articles.append(article)
    
        return articles
    
                    
    
    list_of_companies = ['amazon', 'jet airways', 'nirav modi']
    for company_name in list_of_companies:
        print(company_name)
        search_results = get_google_news_article(company_name)
        for item in sorted(set(search_results)):
            print(item)
        print('\n')
    

    This is the output from the code above:

    amazon
    https://9to5mac.com/2021/11/15/amazon-releases-native-prime-video-app-for-macos-with-purchase-support-and-more/
    https://wtvbam.com/2021/11/15/india-police-to-question-amazon-executives-in-probe-over-marijuana-smuggling/
    https://www.cnet.com/home/smart-home/all-the-new-amazon-features-for-your-smart-home-alexa-disney-echo/
    https://www.cnet.com/tech/amazon-unveils-black-friday-deals-starting-on-nov-25/
    https://www.crossroadstoday.com/i/amazons-best-black-friday-deals-for-2021-2/
    https://www.reuters.com/technology/ibm-amazon-partner-extend-reach-data-tools-oil-companies-2021-11-15/
    https://www.theverge.com/2021/11/15/22783275/amazon-basics-smart-switches-price-release-date-specs
    https://www.tomsguide.com/news/amazon-echo-motion-detection
    https://www.usatoday.com/story/money/shopping/2021/11/15/amazon-black-friday-2021-deals-online/8623710002/
    https://www.winknews.com/2021/11/15/new-amazon-sortation-center-began-operations-monday-could-bring-faster-deliveries/
    
    jet airways
    https://economictimes.indiatimes.com/markets/expert-view/first-time-in-two-decades-new-airlines-are-starting-instead-of-closing-down-jyotiraditya-scindia/articleshow/87660724.cms
    https://menafn.com/1103125331/Jet-Airways-to-resume-operations-in-Q1-2022
    https://simpleflying.com/jet-airways-100-aircraft-5-years/
    https://simpleflying.com/jet-airways-q3-loss/
    https://www.business-standard.com/article/companies/defunct-carrier-jet-airways-posts-rs-306-cr-loss-in-september-quarter-121110901693_1.html
    https://www.business-standard.com/article/markets/stocks-to-watch-ril-aurobindo-bhel-m-m-jet-airways-idfc-powergrid-121110900189_1.html
    https://www.financialexpress.com/market/nykaa-hdfc-zee-media-jet-airways-power-grid-berger-paints-petronet-lng-stocks-in-focus/2366063/
    https://www.moneycontrol.com/news/business/earnings/jet-airways-standalone-september-2021-net-sales-at-rs-41-02-crore-up-313-51-y-o-y-7702891.html
    https://www.spokesman.com/stories/2021/nov/11/boeing-set-to-dent-airbus-india-dominance-with-737/
    https://www.timesnownews.com/business-economy/industry/article/times-now-summit-2021-jet-airways-will-make-a-comeback-into-indian-skies-akasa-to-take-off-next-year-says-jyotiraditya-scindia/831090
    
    
    nirav modi
    https://m.republicworld.com/india-news/general-news/piyush-goyal-says-few-rotten-eggs-destroyed-credibility-of-countrys-ca-sector.html
    https://www.bulletnews.net/akkad-bakkad-rafu-chakkar-review-the-story-of-robbing-people-by-making-fake-banks/
    https://www.daijiworld.com/news/newsDisplay%3FnewsID%3D893048
    https://www.devdiscourse.com/article/law-order/1805317-hc-seeks-centres-stand-on-bankers-challenge-to-dismissal-from-service
    https://www.geo.tv/latest/381560-arif-naqvis-extradition-case-to-be-heard-after-nirav-modi-case-ruling
    https://www.hindustantimes.com/india-news/cbiand-ed-appointments-that-triggered-controversies-101636954580012.html
    https://www.law360.com/articles/1439470/suicide-test-ruling-delays-abraaj-founder-s-extradition-case
    https://www.moneycontrol.com/news/trends/current-affairs-trends/nirav-modi-extradition-case-outcome-of-appeal-to-also-affect-pakistani-origin-global-financier-facing-16-charges-of-fraud-and-money-laundering-7717231.html
    https://www.thehansindia.com/hans/opinion/news-analysis/uniform-law-needed-for-free-exit-of-rich-businessmen-714566
    https://www.thenews.com.pk/print/908374-uk-judge-delays-arif-naqvi-s-extradition-to-us