c++boostboost-locale

Confused with the excepted behavior of boost::locale regarding the capitalization of "ß"


I am trying to use the boost::locale library to perform uppercase and lowercase conversion of strings in my code (version 1.71).

I have an issue with the capitalization of "ß". In order to be compliant with already existing unit tests in my codebase, I want the letter "ß" to be capitalized to "SS". This should not be an issue, since this is the expected behavior as far as I understand (https://www.boost.org/doc/libs/1_71_0/libs/locale/doc/html/conversions.html).

Here is a copy of the example provided on this page for reference:

Upper GRÜSSEN

Lower grüßen

Title Grüßen

Fold grüssen

However, this is not the case when I use the method in my code. The "ß" stays as "ß" when applying the uppercase method.

I was confused and found the following example in the boost::locale library source:

//
//  Copyright (c) 2009-2011 Artyom Beilis (Tonkikh)
//
//  Distributed under the Boost Software License, Version 1.0. (See
//  accompanying file LICENSE_1_0.txt or copy at
//  http://www.boost.org/LICENSE_1_0.txt)
//
#include <boost/locale.hpp>
#include <boost/algorithm/string/case_conv.hpp>
#include <iostream>

#include <ctime>



int main()
{
    using namespace boost::locale;
    using namespace std;
    // Create system default locale
    generator gen;
    locale loc=gen(""); 
    locale::global(loc); 
    cout.imbue(loc);

    
    cout<<"Correct case conversion can't be done by simple, character by character conversion"<<endl;
    cout<<"because case conversion is context sensitive and not 1-to-1 conversion"<<endl;
    cout<<"For example:"<<endl;
    cout<<"   German grüßen correctly converted to "<<to_upper("grüßen")<<", instead of incorrect "
                    <<boost::to_upper_copy(std::string("grüßen"))<<endl;
    cout<<"     where ß is replaced with SS"<<endl;
    cout<<"   Greek ὈΔΥΣΣΕΎΣ is correctly converted to "<<to_lower("ὈΔΥΣΣΕΎΣ")<<", instead of incorrect "
                    <<boost::to_lower_copy(std::string("ὈΔΥΣΣΕΎΣ"))<<endl;
    cout<<"     where Σ is converted to σ or to ς, according to position in the word"<<endl;
    cout<<"Such type of conversion just can't be done using std::toupper that work on character base, also std::toupper is "<<endl;
    cout<<"not even applicable when working with variable character length like in UTF-8 or UTF-16 limiting the correct "<<endl;
    cout<<"behavior to unicode subset BMP or ASCII only"<<endl;
   
}

// vim: tabstop=4 expandtab shiftwidth=4 softtabstop=4

// boostinspect:noascii

I tried compiling it, and this is the result I get:

Correct case conversion can't be done by simple, character by character conversion
because case conversion is context sensitive and not 1-to-1 conversion
For example:
   German grüßen correctly converted to GRÜßEN, instead of incorrect GRüßEN
     where ß is replaced with SS
   Greek ὈΔΥΣΣΕΎΣ is correctly converted to ὀδυσσεύσ, instead of incorrect ὈΔΥΣΣΕΎΣ
     where Σ is converted to σ or to ς, according to position in the word
Such type of conversion just can't be done using std::toupper that work on character base, also std::toupper is 
not even applicable when working with variable character length like in UTF-8 or UTF-16 limiting the correct 
behavior to unicode subset BMP or ASCII only

Emphasis on the part:

   German grüßen correctly converted to GRÜßEN, instead of incorrect GRüßEN
     where ß is replaced with SS

I really don't get what is going on in this sentence. What is the actual expected behavior?


Solution

  • The documentation has been updated in the following commit: https://github.com/Flamefire/locale/commit/bae1f380ad0719121dfe048c56119bf72e074144

    It now reads:

    German grüßen would be incorrectly converted to GRÜßEN, while Boost.Locale converts it to GRÜSSEN where ß is replaced with SS.

    So the expected behavior is indeed to capitalize "ß" as "SS".

    I assume this wasn't the case in my code because I didn't compile boost.locale with the ICU backend. I no longer have access to the original code, so I can't confirm this theory.