pythonpython-3.xweb-scrapinggoogle-newspython-newspaper

Scraping Date of News


I am trying to do scraping from https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya. I am trying to scrape the date of news, here's my code:

news['tanggal'] = newsScrape['date']
dates = []
for x in news['tanggal']:
    x = listToString(x)
    x = x.strip()
    x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
    dates.append(x)
dates = listToString(dates)
dates = dates[0:20]
if len(dates) == 0:
    continue
news['tanggal'] = dt.datetime.strptime(dates, '%d %B %Y, %H:%M')

but I got this error:

ValueError: time data '06 Mei 2021, 11:32  ' does not match format '%d %B %Y, %H:%M'

My assumption is because Mei is in Indonesian language, meanwhile the format need May which is in English. How to change Mei to be May? I have tried dates = dates.replace('Mei', 'May') but it doesnt work on me. When I tried it, I got error ValueError: unconverted data remains: The type of dates is string. Thanks


Solution

  • You can try with the following

    import datetime as dt
    import requests
    from bs4 import BeautifulSoup
    import urllib.request
    
    url="https://finansial.bisnis.com/read/20210506/90/1391096/laba-bank-mega-tumbuh-dua-digit-kuartal-i-2021-ini-penopangnya"
    r = requests.get(url, verify=False)
    soup = BeautifulSoup(r.content, 'html.parser')
    info_soup= soup.find(class_="new-description")
    x=info_soup.find('span').get_text(strip=True)
    x = x.strip()
    x = x.replace('\r', '').replace('\n', '').replace(' \xa0|\xa0', ',').replace('|', ', ')
    x = x[0:20]
    x = x.rstrip()
    date= dt.datetime.strptime(x.replace('Mei', 'May'), '%d %B %Y, %H:%M')
    print(date)
    

    result:

    2021-05-06 11:45:00