javaicuicu4j

icu4j BreakIterator returns incorrect word boundaries for Chinese on Linux


My application needs to be able to detect the number of words in a string. I am using the ICU4J library for this, specifically the BreakIterator. This code needs to work for English, Chinese, Japanese, and German. I found that Chinese seems to work correctly on Windows, but not on linux. On linux it does not find word breaks. I am new to ICU4J so it may be my code?

    public static int getWordBoundaryCount(String term, Locale locale) {
    if (term == null) {
        throw new IllegalArgumentException("term is null");
    }
    int wordBoundaryCount = 0;
    BreakIterator wb = BreakIterator.getWordInstance(locale);
    synchronized(wb) {
        wb.setText(term);
        int start = wb.first();
        for (int end = wb.next(); end != BreakIterator.DONE; start = end, end = wb.next()) {
            String tt = term.substring(start,end);
            System.out.println(tt);
            wordBoundaryCount++;
        }
    }
    return wordBoundaryCount;
}

Given the string, "丙酮" and the locale created from zh_CN. On Windows the above code returns 2, but on Linux it returns 1. Indeed no matter how many Chinese characters are in the string it returns 1. It works fine for English for both Windows and Linux, but it doesn't detect Chinese word boundaries on Linux. I posted this as a Jira in the icu project: According to them Linux is correct, my test cases were all single Chinese words with multiple characters. They didn't address the different behavior on Windows and Linux


Solution

  • According to the ICU4J site, my test cases were all single words with multiple characters so Linux was working correctly. They didn't comment on why the same code behaved differently on Windows. I only need it to behave with Linux. If I knew more Chinese speakers I would have figured this out a long time ago.