web-scrapingbeautifulsouppython-requestsweb-crawlernewspaper3k

How to get the right url after redirection (the one given by the browser) using python


I'm working on a project whose aim is to retrieve all the information from a news article (media website), for this I'm using the library newspaper3K which works quite well.

however I have a problem concerning some urls (redirected link), according to my research newspaper3k does not load the redirection url, it only treats the sent url as a parameter.

Here is an example of a link I would like to deal with:

url = "wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"

so the goal here with this url is to get the right url (after redirection) and then send it to newspaper3K.

I have tried the following solutions but they don't work on my side;

1 - using the library resquests as follows response = requests.get(url, verify=False, allow_redirects=True)

2- using the mechanize library as follows:

br = mechanize.Browser()
resp = br.open(url)

I would like to have the same process as when I use webbrowser (without opening the browser)

import webbrowser
webbrowser.open_new(url)

and finally have the right

url : https://www.20minutes.fr/monde/2943823-20210103-bahamas-disparition-bateau-20-personnes-bord?xtor=EREC-182-[actualite]

thank you in advance for your reply :)


Solution

  • @James Thank you very much for your answer! It helped me a lot.

    I'm currently working on aws glue so I'm forced to use only certain libraries (Selenium is not available I guess) however here is my way to find the link (following your logic of course):

    from bs4 import BeautifulSoup
    import re
    from urllib.parse import unquote
    
    url = "https://wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"
    response = requests.get(url, verify=False, allow_redirects=True)
    
    if response.status_code == 200:
        page = response.text
        # parse the html using beautifulsoup
        html_content = BeautifulSoup(page, 'html.parser')
        soup = html_content
        
    href = soup.find("link", href = True)
    href = href['href']
    
    new_url = unquote(unquote(href))
    

    thanks again for your help, you are a hero :)