url

What do % signs mean in a url?


When I copy paste this Wikipedia article it looks like this.

http://en.wikipedia.org/wiki/Gruy%C3%A8re_%28cheese%29

However if you paste this back into the URL address the percent signs disappear and what appears to be Unicode characters ( and maybe special URL characters ) take the place of the percent signs.

Are these abbreviations for Unicode and special URL characters?

I'm use to seeing \u00ff, etc. in JavaScript.


Solution

  • The reference you're looking for is RFC 3987: Internationalized Resource Identifiers, specifically the section on mapping IRIs to URIs.

    RFC 3986: Uniform Resource Identifiers specifies that reserved characters must be percent-encoded, but it also specifies that percent-encoded characters are decoded to US-ASCII, which does not include characters such as è.

    RFC 3987 specifies that non-ASCII characters should first be encoded as UTF-8 so they can be percent-encoded as per RFC 3986. If you'll permit me to illustrate in Python:

    >>> u'è'.encode('utf-8')
    '\xc3\xa8'
    

    Here I've asked Python to encode the Unicode è to a string of bytes using UTF-8. The bytes returned are 0xc3 and 0xa8. Percent-encoded, this looks like %C3%A8.

    The parentheses also appearing in your URL do fit in US-ASCII, so they are percent-escaped with their US-ASCII code points, which are also valid UTF-8.

    So, no, there is no simple 16×16 table—such a table could never represent the richness of Unicode. But there is a method to the apparent madness.