I am working a list of file names in Java.
I observe that some single characters in the file names, like a, ö and ü actually consist of a sequence you could describe as two single ASCII chars following up:
ö
is represented by o
, ¨
I see this by inspection with codePointAt()
. The German name "Rölli" is in fact "Ro¨lli":
...
20: R, 82
21: o, 111
22: ̈, 776
23: l, 108
24: l, 108
25: i, 105
...
The character ¨
in the log above has the value 776, which is a "Combining Diaeresis". This is a so called combining mark that belongs to the graphemes, or more precisely to the combining diacritics. So it all makes sense, but I do not understand what software component combines the two characters to one umlaut, and where this behavior is specified.
print()
of the string shows me the combined character, so it is neither some UI layer above. What component causes combining characters to be displayed as single combined characters? How reliable is all this?
Has Java a normalization method that makes single code points of combined code points, like here? Would be a help for using Regex...
Thanks a lot for any hint.
Answer 1: Specification and responsibility
The behavior you describe is defined in Unicode Standard Annex #15, Unicode Normalization Forms. This is about the equivalency of combined chars and single code points and about the decomposition of code points. Many languages other then German heavily rely on composing graphemes.
Java internally represents strings as UTF-16. So all it does with its String
class is delivering UTF-16 code chains to other components. It is up to the surrounding software (e.g. any kind of text view components) to combine the chains correctly. You feel this in moments where e.g. a regex breaks your combined ö
apart, yet it is shown correctly in some view.
By the way, if you do some experiments with the Combining Diaeresis, be aware that there is also a "non-functional" code 168, which is a simple ASCII character called "Spacing Diaeresis". Code 168 does not cause any software to combining two code points to one. For this you need the Unicode 776.
Answer 2: Javas normalization method
Basically, you should always take combined chars into account - except you are sure that your data source cannot deliver them. It's a good idea to sanitize your strings first.
Look for unicode normalizing methods in your language, as they release you from fiddling with single replace()
statements and they contain a lot of experience.
Java has a Normalizer
object that deals with different representations of combined characters:
https://docs.oracle.com/javase/7/docs/api/java/text/Normalizer.html
and the tutorial for it: https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html
So after invoking this code line:
String normalized = Normalizer.normalize(someFileName, Normalizer.Form.NFC);
the log print from the question above looks like this:
...
19: , 32
20: R, 82
21: ö, 246 <<< here were two combined chars before normalize()
22: l, 108
23: l, 108
24: i, 105
...