I've been looking into internationalised resource identifiers and there's one thing bugging me.
My understanding is that, for each label in a domain name (xyzzy.plugh.com
has three labels, xyzzy
, plugh
and com
), the following process is performed to translate it into ASCII representation so that it can be processed okay by all legacy software:
xn--
followed by all the ASCII characters (skipping non-ASCII).-
, we output -
to separate the ASCII from non-ASCII.My question then is: how do we distinguish between the following two Unicode URIs?
http://aa☃.net/
http://☃aa.net/
It seems to me that both of these will encode to:
http://xn--aa-nfh.net/
simply because the sequencing information has been lost for the label as a whole.
Or am I missing something in the specification?
According to one punycode encoder, there are encoded differently:
aa☃.net -> xn--aa-gsx.net
☃aa.net -> xn--aa-esx.net
^
see here
The relevant RFC 3492 details why this is the case. First, it provides clues in the introduction:
Uniqueness: There is at most one basic string that represents a given extended string.
Reversibility: Any extended string mapped to a basic string can be recovered from that basic string.
That means there must be differentiable one-to-one mapping for every single basic/extended string pair.
Understanding how it differentiates the two possibilities requires an understanding of the decoder (the thing that turns the basic string back into an extended one, with all its Unicode glory) works.
The decoder begins by starting with just the basic string aa.net
with a pointer to the first a
, then applies a series of deltas, such as gsx
or esx
.
The delta actually encodes two things. The first is the number of non-insertions to be done and the second is the actual insertion.
So, gsx
(the delta in aa☃.net
) would encode two non-insertions (to skip the aa
) followed by an insertion of ☃
. The esx
delta (for ☃aa.net
) would encode zero non-insertions followed by an insertion of ☃
.
That is how position is encoded into the basic strings.