unicodeinternationalizationbidicldr

Strange list pattern format in CLDR-arabic locale


I have observed in CLDR-25-data following entries for list pattern formats in arabic locale (similar also in hebrew):

<listPatterns>
  <listPattern>
    <listPatternPart type="start" draft="contributed">{0}، {1}</listPatternPart>
    <listPatternPart type="middle" draft="contributed">{0}، {1}</listPatternPart>
    <listPatternPart type="end" draft="contributed">{0}، و {1}</listPatternPart>
    <listPatternPart type="2" draft="contributed">{0} و {1}</listPatternPart>
  </listPattern>
</listPatterns>

Note that the LDML-specification only speaks about placeholders of the form "{0}" or "{1}" (not like in list pattern parts for types "end" and "2"). See also:

http://cldr.unicode.org/development/development-process/design-proposals/list-formatting

or

http://cldr.unicode.org/translation/lists

I suspect this has something to do with right-to-left-style, but how in detail?


UPDATE:

Now I have written a small Java program to see the real sequence of chars.

String s = "{0} و {1}"; // as displayed in browser or IDE-window
for (char c : s.toCharArray()) {
    System.out.println(c);
}

The output is:

{
0
}

و

{
1
}

So it seems to be a display problem, not a problem of the char sequence itself?! I use Internet Explorer version 9 and Eclipse 4.3.


Solution

  • The char sequence is here (in codepoints):

    123=>{
    48=>0
    125=>}
    32=> 
    1608=>و   // DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC=true
    32=> 
    123=>{
    49=>1
    125=>}
    

    Unicode infers the display style also from evaluating the bidirectional context. So here the unicode algorithm seems to apply first the standard LTR-context to the first chars found - hence preserving the char sequence "{0} ".

    When the algorithm enters the arabic char it denotes its bidirectional status and applies it to the following next chars. According to the official paper of W3C this means:

    The shape of opening bracket glyph "{" changes to "}" in RTL-context (right-to-left). So from the perspective of arabic char the sequence left to arabic char is "1} ", and this is equivalent to the usual LTR-form " {1". After having read the ASCII-char "1" the unicode algorithm evaluates that now the context is LTR again, so displaying the closing bracket in normal form "}". The final visual result (not in terms of codepoints however) is then as if there were one extra closing bracket and one less opening bracket.

    I hope SO-readers might find this explanation useful if they encounter similar strange visual effects in bidirectional context.