pythonurlparse

URL parsing in Python - normalizing double-slash in paths


I am working on an app which needs to parse URLs (mostly HTTP URLs) in HTML pages - I have no control over the input and some of it is, as expected, a bit messy.

One problem I'm encountering frequently is that urlparse is very strict (and possibly even buggy?) when it comes to parsing and joining URLs that have double-slashes in the path part, for example:

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

Instead of the expected result http://www.example.com//path (or even better, with a normalized single slash), I end up with http://path.

BTW the reason I'm running such code is because it's the only way I found so far to strip the query / fragment part off of URLs. Maybe there is a better way to do it, but I couldn't find one.

Can anyone recommend a way to avoid this, or should I just normalize the path myself using a (relatively simple, I know) regex?


Solution

  • If you only want to get the url without the query part, I would skip the urlparse module and just do:

    testUrl.rsplit('?')
    

    The url will be at index 0 of the list returned and the query at index 1.

    It is not possible to have two '?' in an url so it should work for all urls.