pythonweb-scrapingbeautifulsouphyperlinkimageurl

Python web scraper, with BeautifulSoup I am having problem with my link , the link is now going to headline story but redirecting to the archives page


The link is redirecting me to an archives page with other top stories https://www.coindesk.com/news/babel-finance-bets-on-longtime-fintech-hand-to-help-navigate-regulatory-landscape. The tag news on the link between .com and babel should not be there as I believe it is that which is redirecting the news headline to another page.

from bs4 import BeautifulSoup
import requests


base_url ='https://www.coindesk.com/news'

source = requests.get(base_url).text

soup = BeautifulSoup(source, "html.parser")       
    
    
articles = soup.find_all(class_ = 'list-item-card post')
    
#print(len(articles))
#print(articles) 

    
for article in articles:
      
    headline = article.h4.text.strip()
    link = base_url + article.find_all("a")[1]["href"]
    text = article.find(class_="card-text").text.strip()
    img_url = base_url+article.picture.img['src']
            
    print(headline)
    print(link)
    print(text)
    print("Image " + img_url)
    ```


Solution

  • The error happens because you are concatinating your base link (which already includes /news/) to an absolute url

    To prevent this, you can use urllib.parse.urljoin()

    In your example this should fix the issue:

    from urllib.parse import urljoin
    
    link = urljoin(base_url, article.find_all("a")[1]["href"])