[SOLVED] java IDN functions not reversible?

java IDN functions not reversible?

Why is there some IDN not reversible :

String domain = "aŉċăwb7rňuħ.eu";
System.out.println(domain);
domain = IDN.toASCII(domain);
System.out.println(domain);
domain = IDN.toUnicode(domain);
System.out.println(domain);

It displays :

aŉċăwb7rňuħ.eu
xn--anwb7ru-93a5e8ozmq2m.eu
aʼnċăwb7rňuħ.eu

As you can see, the second character has been splitted !

Thanks

Solution

This is by design. From what I can tell, the 2nd character in your string is a \u0149 codepoint. According to the latest Unicode code charts:

this character is deprecated and its use is strongly discouraged

The Unicode code chart says that the deprecated code point is equivalent to \u02bc followed by \u006e.

The according to the javadocs, first step that IDN.toASCII(String) does is to use the RFC 3491 stringprep / nameprep algorithm to process the characters in the input string. The RFC abstract says:

This document describes how to prepare internationalized domain name (IDN) labels in order to increase the likelihood that name input and name comparison work in ways that make sense for typical users throughout the world. This profile of the stringprep protocol is used as part of a suite of on-the-wire protocols for internationalizing the Domain Name System (DNS).

(In other words, stringprep is designed to make it harder to create tricky domain names that look like one thing and mean something different.)

In fact, if you drill down, you will find that the prescribed mapping in stringprep tables for \u0149 is \u02bc \u006e ; i.e. the equivalent defined in the Unicode code charts.

And ... that is what is happening.

Summary

Your expectation that you can round-trip IDNs is ill-founded.
You shouldn't be using that character anyway, since it is deprecated. (Certainly, it is a bad idea to use it in an IDN!)