c++transformcapitalizelatinpolish

How to capitalize polish special letters in C++?


I've got a string I want to capitalize, but it might contain polish special letters (ą, ć, ę, ł, ń, ó, ś, ż, ź). The function transform(string.begin(), string.end(), string.begin(), ::toupper); only capitalizes the latin alphabet, so I wrote a function like this:


    string to_upper(string nazwa)
    {
        transform(nazwa.begin(), nazwa.end(), nazwa.begin(), ::toupper);

        for (int i = 0; i < (int)nazwa.size(); i++)
        {
            switch(nazwa[i])
            {
                case u'ą':
                {
                    nazwa[i] = u'Ą';
                    break;
                }
                case u'ć':
                {
                    nazwa[i] = u'Ć';
                    break;
                }
                case u'ę':
                {
                    nazwa[i] = u'Ę';
                    break;
                }
                case u'ó':
                {
                    nazwa[i] = u'Ó';
                    break;
                }
                case u'ł':
                {
                    nazwa[i] = u'Ł';
                    break;
                }
                case u'ń':
                {
                    nazwa[i] = u'Ń';
                    break;
                }
                case u'ś':
                {
                    nazwa[i] = u'Ś';
                    break;
                }
                case u'ż':
                {
                    nazwa[i] = u'Ż';
                    break;
                }
                case u'ź':
                {
                    nazwa[i] = u'Ź';
                    break;
                }
            }
        }

        return nazwa;
    }

I also tried using if instead of switch but it doesn't change anything. In Qt Creator next to every capital letter to be inserted apart from u'Ó' gives me a similar error: Implicit conversion from 'char16_t' to 'std::basic_string<char>::value_type' (aka 'char') changes value from 260 to 4 (this is from u'Ą'). After running the program, the chars in the string aren't swaped.


Solution

  • The source of your issue

    std::string stores characters as chars, which are one byte long, and therefore their value can only go from 0 to 255.

    This makes it impossible to store u'ą' in one char for example, as the unicode value for ą is 0x105 (= 261 in decimal, which is higher than 255).

    To avoid this problem, humans have invented UTF-8, which is a character encoding standard that lets you encode any Unicode characters as bytes. Characters that have a higher value will of course take multiple bytes to encode.

    It is very likely that your std::string have its characters encoded in UTF-8. (I say very likely because your code doesn't directly indicate it, but it is pretty much 100% certain that it is the case, because it's the only universal way to encode accented letters in char-based strings. To be absolutely 100% sure, you'd need to check Qt's code, since it seems to be what you are using)

    The result of this is that you can't just use a for to iterate through the chars of your std::string the way that you are because you basically assume that one char equals one letter, which is simply not the case.

    In the case of ą for example, it'll be encoded as bytes C4 85, so you will have one char that will have the value 0xC4 (= 196) followed by another char of value 0x85 (= 133).


    The specific case for the characters you want to capitalize

    The Latin Extended-A part of the Unicode table (archive) fortunately shows us that these special capital letters come right before their lowercase counterparts.

    More than that, we can see that:

    This will make it easier to convert lowercase code points to uppercase ones, since all we have to do is check if the index of a character corresponds to a lowercase one, and if so, subtract one to it to make it uppercase.


    Encoding one of those characters in UTF-8

    To encode these in UTF-8 (source):

    So for ą, value is 0x105 in hex, so 00100000101 in binary.

    First byte value is then 11000100 (= 0xC4).

    Second byte value is then 10000101 (= 0x85).

    Note that this encoding 'technique' works because the characters you want to capitalize have their value (code point) between 0x80 and 0x7FF. It changes depending of how high the value is, see documentation here.


    Fixing your code

    I have rewritten your to_upper function accoding to what I have written so far:

    string to_upper(string nazwa)
    {
        for (int i = 0; i < (int)nazwa.size(); i++)
        {
            // Getting the current character we are working with
            char chr1 = nazwa[i];
    
            // We want to find UTF-8-encoded polish letters here
            // So we are looking for a character that has first three bits set to 110,
            // as all polish letters encoded in UTF-8 are in UTF-8 Class 1 and therefore
            // are two bytes long, the first byte being of binary value 110xxxxx
            if(((chr1 >> 5) & 0b111) != 0b110) {
                nazwa[i] = toupper(chr1); // Do the std toupper here for regular characters
                continue;
            }
    
            // If we are here, then the character we are dealing with is two bytes long, so get its value.
            // We won't need to check for that second byte during next iteration, so we increment i
            i++;
            char chr2 = nazwa[i];
    
            // Get the unicode value of the encoded character
            uint16_t fullChr = ((chr1 & 0b11111) << 6) | (chr2 & 0b111111);
    
            // Get the various conditions to check for lowercase code points
            bool lowercaseIsOdd =  (fullChr >= 0x100 && fullChr <= 0x137) || (fullChr >= 0x14A && fullChr <= 0x177);
            bool lowercaseIsEven = (fullChr >= 0x139 && fullChr <= 0x148) || (fullChr >= 0x179 && fullChr <= 0x17E);
            bool chrIndexIsOdd =   (fullChr % 2) == 1;
    
            // Depending of whether the code point needs to be odd or even to be lowercase and depending of if the code point
            // is odd or even, decrease it by one to make it uppercase
            if((lowercaseIsOdd && chrIndexIsOdd)
            || (lowercaseIsEven && !chrIndexIsOdd))
                fullChr--;
    
            // Support for some additional, more commonly used accented letters
            if(fullChr >= 0xE0 && fullChr <= 0xF6)
                fullChr -= 0x20;
    
            // Re-encode the character point in UTF-8
            nazwa[i-1] = (0b110 << 5) | ((fullChr >> 6) & 0b11111); // We incremented i earlier, so subtract one to edit the first byte of the letter we're encoding
            nazwa[i] = (0b10 << 6) | (fullChr & 0b111111);
        }
    
        return nazwa;
    }
    

    Note: don't forget to #include <cstdint> for uint16_t to work.

    Note 2: I have added support for some Latin 1 Supplement (archive) letters because you asked for it in comments. Although we subtract 0x20 from lowercase code points to get the uppercase ones, it is pretty much the same principle as for other letters I have covered in this answer.

    I have included lots of comments in my code, please consider reading them for a better understanding.

    I have tested it with the string "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž" and it converted it to "ĀĀĂĂĄĄĆĆĈĈĊĊČČĎĎĐĐĒĒĔĔĖĖĘĘĚĚĜĜĞĞĠĠĢĢĤĤĦĦĨĨĪĪĬĬĮĮİİIJIJĴĴĶĶĸĹĹĻĻĽĽĿĿŁŁŃŃŅŅŇŇŊŊŌŌŎŎŐŐŒŒŔŔŖŖŘŘŚŚŜŜŞŞŠŠŢŢŤŤŦŦŨŨŪŪŬŬŮŮŰŰŲŲŴŴŶŶŸŹŹŻŻŽŽ", so it works perfectly:

    int main() {
        string str1 = "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž";
        string str2 = to_upper(str1);
    
        printf("str1: %s\n", str1.c_str());
        printf("str2: %s\n", str2.c_str());
    }
    

    Picture of a CMD printing the results of the above code

    Note: All terminals use UTF-8 by default, Qt labels as well, basically EVERYTHING uses UTF-8, EXCEPT the Windows CMD, so if you are testing the above code on a Windows CMD or Powershell, you need to change them to UTF-8 using command chcp 65001, or by adding a Windows API call to change the CMD encoding when you execute your code.

    Note 2: When you write raw strings directly in your code, your compiler will encode them in UTF-8 by default. Which is why my version of the to_upper function works with polish letters directly written in code without further modifications. When I say that EVERYTHING uses UTF-8, I mean it.

    Note 3: I kept it to avoid causing problems with your current code, but you use string instead of std::string, implying that you have a using namespace std; somewhere in your code. In which case, please see Why is "using namespace std;" considered bad practice?


    Note about the other answers

    Please keep in mind that my answer is very specific to your case. It aims to, as you asked for, capitalize polish letters.

    Other answers rely on std features which are apparently more universal and work with all languages, so I'd invite you to give them a look.

    It's always better to rely on existing features rather than reinventing the wheel, but I think it's also good to have a self-made alternative that might be easier to understand and sometimes is more efficient.