c++c++11boostboost-locale

Boost locale normalize strips characters but no accents


I am trying to strip accents from a string using the boost local library.

The normalize function removes the entire character with the accent, i only want to remove the accent.

è -> e for example

Here is my code

std::string hello(u8"élève");
boost::locale::generator gen;
std::string str = boost::locale::normalize(hello,boost::locale::norm_nfd,gen(""));

Desired ouput : eleve

My Output : lve

Help please


Solution

  • That's not what normalize does. With nfd it does "canonical decomposition". You need to THEN remove the combining character code points.

    UPDATE Adding a loose implementation gleaning from the utf8 tables that most combining character appear to lead with 0xcc or 0xcd:

    Live On Wandbox

    // also liable to strip some greek characters that lead with 0xcd
    template <typename Str>
    static Str try_strip_diacritics(
        Str const& input,
        std::locale const& loc = std::locale())
    {
        using Ch = typename Str::value_type;
        using T = boost::locale::utf::utf_traits<Ch>;
    
        auto tmp = boost::locale::normalize(
                    input, boost::locale::norm_nfd, loc);
    
        auto f = tmp.begin(), l = tmp.end(), out = f;
    
        while (f!=l) {
            switch(*f) {
                case '\xcc':
                case '\xcd': // TODO find more
                    T::decode(f, l);
                    break; // skip
                default:
                    out = T::encode(T::decode(f, l), out);
                    break;
            }
        }
        tmp.erase(out, l);
        return tmp;
    }
    

    Prints (on my box!):

    Before: "élève"  0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
    all-in-one: "eleve"  0x65 0x6c 0x65 0x76 0x65
    

    Older answer text/analysis:

    #include <boost/locale.hpp>
    #include <iomanip>
    #include <iostream>
    
    static void dump(std::string const& s) {
        std::cout << std::hex << std::showbase << std::setfill('0');
        for (uint8_t ch : s)
            std::cout << " " << std::setw(4) << int(ch);
        std::cout << std::endl;
    }
    
    int main() {
        boost::locale::generator gen;
    
        std::string const pupil(u8"élève");
    
        std::string const str =
            boost::locale::normalize(
                pupil,
                boost::locale::norm_nfd,
                gen(""));
    
        std::cout << "Before: "; dump(pupil);
        std::cout << "After:  "; dump(str);
    }
    

    Prints, on my box:

    Before:  0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
    After:   0x65 0xcc 0x81 0x6c 0x65 0xcc 0x80 0x76 0x65
    

    However, on Coliru it makes no difference. This indicates that it depends on the available/system locales.

    The docs say: https://www.boost.org/doc/libs/1_72_0/libs/locale/doc/html/conversions.html#conversions_normalization

    Unicode normalization is the process of converting strings to a standard form, suitable for text processing and comparison. For example, character "ü" can be represented by a single code point or a combination of the character "u" and the diaeresis "¨". Normalization is an important part of Unicode text processing.

    Unicode defines four normalization forms. Each specific form is selected by a flag passed to normalize function:

    • NFD - Canonical decomposition - boost::locale::norm_nfd
    • NFC - Canonical decomposition followed by canonical composition - boost::locale::norm_nfc or boost::locale::norm_default
    • NFKD - Compatibility decomposition - boost::locale::norm_nfkd
    • NFKC - Compatibility decomposition followed by canonical composition - boost::locale::norm_nfkc

    For more details on normalization forms, read [this article][1].

    What you could do

    It seems that you MIGHT get some way by doing the Decomposition only (so NFD) and then removing any code-points that aren't alpha.

    This is cheating, because it assumes all code-points are single-unit, which is not generically true, but for the sample it does work:

    See improved version above which does iterate over code-points instead of bytes.