javaandroidword-wrapbreakiterator

How does BreakIterator work in Android?


I'm making my own text processor in Android (a custom vertical script TextView for Mongolian). I thought I would have to find all the line breaking locations myself so that I could implement line wrapping, but then I discovered BreakIterator. This seems to find all the possible breaks between characters, words, lines, and sentences in various languages.

I'm trying to learn how to use it. The documentation was more helpful than average, but it was still difficult to understand from just reading. I also found a few tutorials (see here, here, and here) but they lacked the full explanation with output that I was looking for.

I am adding this Q&A style answer to help myself learn how to use BreakIterator.

I'm making this an Android tag in addition to Java because there is apparently some difference between them. Also, Android now supports the ICU BreakIterator and future answers may deal with this.


Solution

  • BreakIterator can be used to find the possible breaks between characters, words, lines, and sentences. This is useful for things like moving the cursor through visible characters, double clicking to select words, triple clicking to select sentences, and line wrapping.

    Boilerplate code

    The following code is used in the examples below. Just adjust the first part to change the text and type of BreakIterator.

    // change these two lines for the following examples
    String text = "This is some text.";
    BreakIterator boundary = BreakIterator.getCharacterInstance();
    
    // boiler plate code
    boundary.setText(text);
    int start = boundary.first();
    for (int end = boundary.next(); end != BreakIterator.DONE; end = boundary.next()) {
        System.out.println(start + " " + text.substring(start, end));
        start = end;
    }
    

    If you just want to test this out, you can paste it directly into an Activity's onCreate in Android. I'm using System.out.println rather than Log so that it is also testable in a Java only environment.

    I'm using the java.text.BreakIterator rather than the ICU one, which is only available from API 24. See the links at the bottom for more information.

    Characters

    Change the boilerplate code to include the following

    String text = "Hi 中文éé\uD83D\uDE00\uD83C\uDDEE\uD83C\uDDF3.";
    BreakIterator breakIterator = BreakIterator.getCharacterInstance();
    

    Output

    0 H
    1 i
    2  
    3 中
    4 文
    5 é
    6 é
    8 😀
    10 🇮🇳
    14 .
    

    The most interest parts are at indexes 6, 8, and 10. Your browser may or may not display the characters correctly, but a user would interpret all of these to be single characters even though they are made up of multiple UTF-16 values.

    Words

    Change the boilerplate code to include the following:

    String text = "I like to eat apples. 我喜欢吃苹果。";
    BreakIterator boundary = BreakIterator.getWordInstance();
    

    Output

    0 I
    1  
    2 like
    6  
    7 to
    9  
    10 eat
    13  
    14 apples
    20 .
    21  
    22 我
    23 喜欢
    25 吃
    26 苹果
    28 。
    

    There are a few interesting things to note here. First, a word break is detected at both sides of a space. Second, even though there are different languages, multi-character Chinese words were still recognized. This was still true in my tests even when I set the locale to Locale.US.

    Lines

    You can keep the code the same as for the Words example:

    String text = "I like to eat apples. 我喜欢吃苹果。";
    BreakIterator boundary = BreakIterator.getLineInstance();
    

    Output

    0 I 
    2 like 
    7 to 
    10 eat 
    14 apples. 
    22 我
    23 喜
    24 欢
    25 吃
    26 苹
    27 果。
    

    Note that the break locations are not whole lines of text. They are just convenient places to line wrap text.

    The output is similar to the Words example. However, now white space and punctuation is included with the word before it. This makes sense because you wouldn't want a new line to start with white space or punctuation. Also note that Chinese characters get line breaks for every character. This is consistent with the fact that it is ok to break multi-character words across lines in Chinese.

    Sentences

    Change the boilerplate code to include the following:

    String text = "I like to eat apples. My email is me@example.com.\n" +
            "This is a new paragraph. 我喜欢吃苹果。我不爱吃臭豆腐。";
    BreakIterator boundary = BreakIterator.getSentenceInstance();
    

    Output

    0 I like to eat apples. 
    22 My email is me@example.com.
    50 This is a new paragraph. 
    75 我喜欢吃苹果。
    82 我不爱吃臭豆腐。
    

    Correct sentence breaks were recognized in multiple languages. Also, there was no false positive for the dot in the email domain.

    Notes

    You can set the Locale when you create a BreakIterator, but if you don't it just uses the default locale.

    Further reading