[SOLVED] Safely handling potentially malicious URLs using Requests

Safely handling potentially malicious URLs using Requests

I'm building a tool for analyzing emails to try to determine if they're phishing or not and I'd like to see if any of the links in the email redirect, and if they do how many times and to where. I'm currently using the requests library to handle all of that stuff and in order to get a link's history you have to call .get(). Is this safe to do on potentially malicious URLs, and if not is there any way I can get the redirect information without putting my computer at risk?

Solution

You could send a HEAD request with allow_redirects=True:

>>> url = "http://stackoverflow.com/q/57298432/7954504"
>>> resp = requests.request(
...     "HEAD",
...     url,
...     allow_redirects=True
... )
>>> resp.history
[<Response [301]>, <Response [302]>]
>>> [i.url for i in resp.history]
['http://stackoverflow.com/q/57298432/7954504', 'https://stackoverflow.com/q/57298432/7954504']

Not saying this is a cure-all. Something else to consider is adding some heuristics on the URL itself, in the spirit of "you know a crappy-looking URL when you see one." (I like yarl for analyzing URLs.) For instance:

What's the file extension?
Is the URl using a nonstandard port?
Is the domain an IPv4/6 address or localhost?
Are any of the query parameter values themselves valid URLs?

...and so on.