urlbrowserunicodepunycodeiri

Why/how does the browser decide ☃.net goes to xn--n3h.net


If we type into firefox or chrome

http://☃.net/

It takes us to

http://xn--n3h.net/

Which is a mirror of unicodesnowmanforyou.com

What I don't understand is by what rules the unicode snowman can decode to xn--n3h, it doesn't look anything like utf-8 or urlencoding.

I think I found a hint while mucking around in python3, because:

>>> '☃'.encode('punycode')
b'n3h'

But I still don't understand the xn-- part. How are domain names internationalised, what is the standard and where is this stuff documented?


Solution

  • It uses an encoding scheme called Punycode (as you've already discovered from the Python testing you've done), capable of representing Unicode characters in ASCII-only format.

    Each label (delimited by dots, so get.me.a.coffee.com has five labels) that contains Unicode characters is encoded in Punycode and prefixed with the string xn--.

    The label encoding first copies all the ASCII characters, then appends the encoded Unicode characters. The Unicode characters are always after the final - in the label, so one is added after the ASCII characters if needed.

    More detail can be found in this page over at the w3 site, and in RFC 3987. For details on how Punycode actually encodes labels, see the Wikipedia page.