unicodejvmjavajava-7

How to identify programmatically in Java which Unicode version supported?


Due to the fact that Java code could be run in any Java VM I'd like to know how is it possible to identify programmatically which Unicode version supported?


Solution

  • The easiest way but worst way I can think of to do that would be to pick a code point that’d new to each Unicode release, and check its Character properties. Or you could check its General Category with a regex. Here are some selected code points:

    I've included the general category and the script property, although you can only inspect the script in JDK7, the first Java release that supports that.

    I found those code points by running commands like this from the command line:

    % unichars -gs '\p{Age=5.1}'
    % unichars -gs '\p{Lu}' '\p{Age=5.0}'
    

    Where that’s the unichars program. It will only find properties supported in the Unicode Character Database for whichever UCD version that the version of Perl you’re running supports.

    I also like my output sorted, so I tend to run

     % unichars -gs '\p{Alphabetic}' '\p{Age=6.0}' | ucsort | less -r
    

    where that’s the ucsort program, which sorts text according to the Unicode Collation Algorithm.

    However, in Perl unlike in Java this is easy to find out. For example, if you run this from the command line (yes, there’s a programmer API, too), you find:

    $ corelist -a Unicode
        v5.6.2     3.0.1     
        v5.8.0     3.2.0     
        v5.8.1     4.0.0 
        v5.8.8     4.1.0
        v5.10.0    5.0.0     
        v5.10.1    5.1.0 
        v5.12.0    5.2.0 
        v5.14.0    6.0.0
    

    That shows that Perl version 5.14.0 was the first one to support Unicode 6.0.0. For Java, I believe there is no API that gives you this information directly, so you’ll have to hardcode a table mapping Java versions and Unicode versions, or else use the empirical method of testing code points for properties. By empirically, I mean the equivalent of this sort of thing:

    % ruby -le 'print "\u2C75" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"'
    pass 5.2
    % ruby -le 'print "\uA7A0" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"'
    fail 6.0
    % ruby -v
    ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0]
    
    % perl -le 'print "\x{2C75}" =~ /\p{Lu}/ ? "pass 5.2" : "fail 5.2"'
    pass 5.2
    % perl -le 'print "\x{A7A0}" =~ /\p{Lu}/ ? "pass 6.0" : "fail 6.0"'
    pass 6.0
    % perl -v
    This is perl 5, version 14, subversion 0 (v5.14.0) built for darwin-2level
    

    To find out the age of a particular code point, run uniprops -a on it like this:

    % uniprops -a 10424
    U+10424 ‹𐐤› \N{DESERET CAPITAL LETTER EN}
     \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
     All Any Alnum Alpha Alphabetic Assigned InDeseret Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Deseret Dsrt Lu L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter L_ Uppercase_Letter Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Upper X_POSIX_Word
     Age=3.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Deseret Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None Script=Deseret East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Dsrt Script=Dsrt Sentence_Break=UP Sentence_Break=Upper SB=UP Word_Break=ALetter WB=LE Word_Break=LE _X_Begin
    

    All my Unicode tools are available in the Unicode::Tussle bundle, including unichars, uninames, uniquote, ucsort, and many more.

    Java 1.7 Improvements

    JDK7 goes a long way to making a few Unicode things easier. I talk about that a bit at the end of my OSCON Unicode Support Shootout talk. I had thought of putting together a table of which languages supports which versions of Unicode in which versions of those languages, but ended up scrapping that to tell people to just get the latest version of each language. For example, I know that Unicode 6.0.0 is supported by Java 1.7, Perl 5.14, and Python 2.7 or 3.2.

    JDK7 contains updates for classes Character, String, and Pattern in support of Unicode 6.0.0. This includes support for Unicode script properties, and several enhancements to Pattern to allow it to meet Level 1 support requirements for Unicode UTS#18 Regular Expressions. These include

    I can certainly see why you want to make sure you’re running a Java with Unicode 6.0.0 support, since that comes with all those other benefits, too.