punycodeiri

How does punycode distinguish similar IRIs?


I've been looking into internationalised resource identifiers and there's one thing bugging me.

My understanding is that, for each label in a domain name (xyzzy.plugh.com has three labels, xyzzy, plugh and com), the following process is performed to translate it into ASCII representation so that it can be processed okay by all legacy software:

My question then is: how do we distinguish between the following two Unicode URIs?

http://aa☃.net/
http://☃aa.net/

It seems to me that both of these will encode to:

http://xn--aa-nfh.net/

simply because the sequencing information has been lost for the label as a whole.

Or am I missing something in the specification?


Solution

  • According to one punycode encoder, there are encoded differently:

    aa☃.net -> xn--aa-gsx.net
    ☃aa.net -> xn--aa-esx.net
                      ^
                      see here
    

    The relevant RFC 3492 details why this is the case. First, it provides clues in the introduction:

    Uniqueness: There is at most one basic string that represents a given extended string.

    Reversibility: Any extended string mapped to a basic string can be recovered from that basic string.

    That means there must be differentiable one-to-one mapping for every single basic/extended string pair.

    Understanding how it differentiates the two possibilities requires an understanding of the decoder (the thing that turns the basic string back into an extended one, with all its Unicode glory) works.

    The decoder begins by starting with just the basic string aa.net with a pointer to the first a, then applies a series of deltas, such as gsx or esx.

    The delta actually encodes two things. The first is the number of non-insertions to be done and the second is the actual insertion.

    So, gsx (the delta in aa☃.net) would encode two non-insertions (to skip the aa) followed by an insertion of . The esx delta (for ☃aa.net) would encode zero non-insertions followed by an insertion of .

    That is how position is encoded into the basic strings.