pythonpandasweb-scrapingscrapynewspaper3k

News scraping multiple url inside a dataframe


So I am try using Newspaper3k for scraping content of a few website.In the library the function Article() only take a single url.Is this possible to iterate a dataframe a full of url to scrape it automated?My df is like this

df = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia','https://ekonomi.bisnis.com/read/20210918/98/1443952/pos-indonesia-gandeng-nujek-perluas-segmen-pengiriman','https://www.republika.co.id/berita/qzkxdm380/perkuat-layanan-pt-pos-indonesia-gandeng-kurir-wanita']

I try a few possible answer like this

for x in df.iterrows():
    print(x)
a = Article(x,language='id')
b = a.download()
c = a.parse()

But it get a error

AttributeError: 'tuple' object has no attribute 'decode'

I also try


a = Article(url=x in df.iterrows(),language='id')
b = a.download()
c = a.parse()
author = a.authors
date = a.publish_date
text = a.text

combine = {'author':author,'date':date,'text':text}
data = pd.DataFrame(data=combine)

but got an error

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I try a few more codes I really appriciated if get a help.Thanks


Solution

  • df is not a dataframe, it's a list. Just iterate through the list.

    from newspaper import Article
    import pandas as pd
    
    urls = ['https://www.liputan6.com/bisnis/read/4661489/erick-thohir-apresiasi-transformasi-digital-pos-indonesia','https://ekonomi.bisnis.com/read/20210918/98/1443952/pos-indonesia-gandeng-nujek-perluas-segmen-pengiriman','https://www.republika.co.id/berita/qzkxdm380/perkuat-layanan-pt-pos-indonesia-gandeng-kurir-wanita']
    
    rows = []
    for url in urls:
        try:
            a = Article(url,language='id')
            a.download()
            a.parse()
             
            author = a.authors
            date = a.publish_date
            text = a.text
            
            print(author, date, text)
            row = {'url':url,
                   'author':author,
                   'data':date,
                   'text':text}
            
            rows.append(row)
        except Exception as e:
            print(e)
            row = {'url':url,
            'author':'N/A',
            'data':'N/A',
            'text':'N/A'}
            
            rows.append(row)
            
    df = pd.DataFrame(rows)