I'm working on a project whose aim is to retrieve all the information from a news article (media website), for this I'm using the library newspaper3K which works quite well.
however I have a problem concerning some urls (redirected link), according to my research newspaper3k does not load the redirection url, it only treats the sent url as a parameter.
Here is an example of a link I would like to deal with:
url = "wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"
so the goal here with this url is to get the right url (after redirection) and then send it to newspaper3K.
I have tried the following solutions but they don't work on my side;
1 - using the library resquests as follows response = requests.get(url, verify=False, allow_redirects=True)
2- using the mechanize library as follows:
br = mechanize.Browser()
resp = br.open(url)
I would like to have the same process as when I use webbrowser (without opening the browser)
import webbrowser
webbrowser.open_new(url)
and finally have the right
thank you in advance for your reply :)
@James Thank you very much for your answer! It helped me a lot.
I'm currently working on aws glue so I'm forced to use only certain libraries (Selenium is not available I guess) however here is my way to find the link (following your logic of course):
from bs4 import BeautifulSoup
import re
from urllib.parse import unquote
url = "https://wtm.actualite.20minutes.fr/redirection.html?m=3e2b20a2f1f6dd3c60608f54d7ad4dc5&c=fr&u=https%3A%2F%2Fwww.20minutes.fr%2Fmonde%2F2943823-20210103-bahamas-disparition-bateau-20-personnes-bord%3Fxtor%3DEREC-182-%5Bactualite%5D&dc=yt0U%2FI8COMJyjwQQ1fA2kVEXpoP0nsZydMTZS6jTm2DdKasFuV%2FVA7rEphhqMfGAy%2FlztUlVN4MJt5tg%2FQXfJwmXMRQL8g3Gfwhl%2BsjkkYmd%2BDxDUhb%2BpPRL%2BNsiDETNQeP3MmrQ6ATGJT%2Blf46Zg4DHd%2FzaXy%2B7UAuxatp2UcVd39HKuuMfQHmyDV%2BAxSAJrd4x5CxHqy3uTtZoQEjwGdZ%2FRtoa7YLOWLKhN9tg4TM%3D"
response = requests.get(url, verify=False, allow_redirects=True)
if response.status_code == 200:
page = response.text
# parse the html using beautifulsoup
html_content = BeautifulSoup(page, 'html.parser')
soup = html_content
href = soup.find("link", href = True)
href = href['href']
new_url = unquote(unquote(href))
thanks again for your help, you are a hero :)