c++boostnlpicuboost-locale

Is it possible to get boost locale boundary analysis to split on apostrophes?


For example consider the following code:

using namespace boost::locale::boundary;
boost::locale::generator gen;
std::string text = "L'homme qu'on aimait trop.";
ssegment_index map(word, text.begin(), text.end(), gen("fr_FR.UTF-8"));
for (ssegment_index::iterator it = map.begin(), e = map.end(); it != e; ++it)
    std::cout << "\"" << *it << "\", ";
std::cout << std::endl;

This outputs:

"L'homme", " ", "qu'on", " ", "aimait", " ", "trop", ".",

Is it possible to customize boundary analysis so it instead outputs:

"L", "'", "homme", " ", "qu", "'", "on", " ", "aimait", " ", "trop", ".",

I've read http://www.boost.org/doc/libs/1_56_0/libs/locale/doc/html/boundary_analysys.html and searched Stack Overflow and Google, but so far haven't found anything.


Solution

  • I haven't found a way to do this with boost::locale::boundary, but it is possible to do it with ICU directly by creating a customized RuleBasedBreakIterator, rather than using one provided by createWordInstance.

    Locale locale("fr_FR");
    UErrorCode statusError = U_ZERO_ERROR;
    UParseError parseError = { 0 };
    
    // get rules from a default rbbi (these should be in a word.txt file somewhere)
    RuleBasedBreakIterator *default_rbbi = dynamic_cast<RuleBasedBreakIterator *>(RuleBasedBreakIterator::createWordInstance(locale, statusError));
    UnicodeString rules = default_rbbi->getRules();
    delete default_rbbi;
    
    // create custom rbbi with updated rules
    rules.findAndReplace("[\\p{Word_Break = MidNumLet}]", "[[\\p{Word_Break = MidNumLet}] - [\\u0027 \\u2018 \\u2019 \\uff07]]");
    RuleBasedBreakIterator custom_rbbi(rules, parseError, statusError);
    
    // tokenize text
    UnicodeString text = "L'homme qu'on aimait trop.";
    custom_rbbi.setText(text);
    int32_t e, p = custom_rbbi.first();
    while ((e = custom_rbbi.next()) != BreakIterator::DONE) {
        std::string substring;
        text.tempSubStringBetween(p, e).toUTF8String(substring);
        std::cout << "\"" << substring << "\", ";
        p = e;
    }
    std::cout << std::endl;