pythonregex

Regex to extract all urls from string


I have a string like this

http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/

I would like to extract all url / webaddress into a Array. for example

urls = ['http://example.com/path/topage.html','http://twitter.com/p/xyan',.....]

Here is my approach which didn't work.

import re
strings = "http://example.com/path/topage.htmlhttp://twitter.com/p/xyanhshttp://httpget.org/get.zipwww.google.com/privacy.htmlhttps://goodurl.net/"
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strings)

print links
// result always same as strings 

Solution

  • The problem is that your regex pattern is too inclusive. It includes all urls. You can use lookahead by using (?=)

    Try this:

    re.findall("((www\.|http://|https://)(www\.)*.*?(?=(www\.|http://|https://|$)))", strings)