htmldjangourlize

Urlize escaping %23 in href of text


I am trying to render text from a user with a url like this in it: https://example.com/%20%23654

I pass the url to urlize and I get this:

In[1]: outp = urlize('https://example.com/%20%23654'); print outp
Out[1]: u'<a href="https://example.com/%20#654">https://example.com/%20%23654</a>'

I understand that %20 escapes to a space and %23 to a hash, but why is it only escaping the hash in the href? Is this a bug? If it were intended, why is it not escaping the %20 to blank space?


Solution

  • I don't think this is a bug.

    I see two parts to this question:

    Why is it only unescaping the hash and not the space? Why is it only doing the unescaping in the href and not in the visible linked text?

    Here are my thoughts on the first:

    A hash is a perfectly legal URL path character. It is most often used to go to anchors in HTML (example and link to docs in one!):

    http://www.w3.org/TR/html4/struct/links.html#h-12.2

    urlize realizes this. It unescapes the hash in the href. It works with any letter which is a legal URL character. Here is an example with the letter f:

    >>> urlize('https://example.com/%66')
    u'<a href="https://example.com/f">https://example.com/%66</a>'
    

    A space on the other hand is not a legal URL character (although it is often tolerated). Therefore, it remains encoded to %20 both in the link and in the visible link depiction.

    The second part of the question is why is it only unescaping in the link but not in the visible depiction. That also makes sense. In the href, it does not matter whether you pass in https://example.com/%66 or https://example.com/f. The effect is the same, and the depiction is "under the hood." So urlize uses the simplest form, without the unnecessary encoding. On the other hand, the visible part is presented to the user. Therefore, urlize tries to preserve the exact depiction which it was passed in originally, as that is the least surprising thing to do.