I've got a string I want to capitalize, but it might contain polish special letters (ą, ć, ę, ł, ń, ó, ś, ż, ź). The function transform(string.begin(), string.end(), string.begin(), ::toupper);
only capitalizes the latin alphabet, so I wrote a function like this:
string to_upper(string nazwa)
{
transform(nazwa.begin(), nazwa.end(), nazwa.begin(), ::toupper);
for (int i = 0; i < (int)nazwa.size(); i++)
{
switch(nazwa[i])
{
case u'ą':
{
nazwa[i] = u'Ą';
break;
}
case u'ć':
{
nazwa[i] = u'Ć';
break;
}
case u'ę':
{
nazwa[i] = u'Ę';
break;
}
case u'ó':
{
nazwa[i] = u'Ó';
break;
}
case u'ł':
{
nazwa[i] = u'Ł';
break;
}
case u'ń':
{
nazwa[i] = u'Ń';
break;
}
case u'ś':
{
nazwa[i] = u'Ś';
break;
}
case u'ż':
{
nazwa[i] = u'Ż';
break;
}
case u'ź':
{
nazwa[i] = u'Ź';
break;
}
}
}
return nazwa;
}
I also tried using if
instead of switch
but it doesn't change anything.
In Qt Creator next to every capital letter to be inserted apart from u'Ó' gives me a similar error: Implicit conversion from 'char16_t' to 'std::basic_string<char>::value_type' (aka 'char') changes value from 260 to 4
(this is from u'Ą'). After running the program, the chars in the string aren't swaped.
std::string
stores characters as char
s, which are one byte long, and therefore their value can only go from 0 to 255.
This makes it impossible to store u'ą'
in one char
for example, as the unicode value for ą
is 0x105
(= 261 in decimal, which is higher than 255).
To avoid this problem, humans have invented UTF-8
, which is a character encoding standard that lets you encode any Unicode characters as bytes. Characters that have a higher value will of course take multiple bytes to encode.
It is very likely that your std::string
have its characters encoded in UTF-8. (I say very likely because your code doesn't directly indicate it, but it is pretty much 100% certain that it is the case, because it's the only universal way to encode accented letters in char
-based strings. To be absolutely 100% sure, you'd need to check Qt's code, since it seems to be what you are using)
The result of this is that you can't just use a for
to iterate through the char
s of your std::string
the way that you are because you basically assume that one char
equals one letter, which is simply not the case.
In the case of ą
for example, it'll be encoded as bytes C4 85
, so you will have one char
that will have the value 0xC4
(= 196) followed by another char
of value 0x85
(= 133).
The Latin Extended-A part of the Unicode table (archive) fortunately shows us that these special capital letters come right before their lowercase counterparts.
More than that, we can see that:
This will make it easier to convert lowercase code points to uppercase ones, since all we have to do is check if the index of a character corresponds to a lowercase one, and if so, subtract one to it to make it uppercase.
To encode these in UTF-8 (source):
110xxxxx
, replace xxxxx
with the higher five bytes of the binary code point of the character10xxxxxx
, replace xxxxxx
with the lower six bytes of the binary code point of the characterSo for ą
, value is 0x105
in hex, so 00100
000101
in binary.
First byte value is then 110
00100
(= 0xC4).
Second byte value is then 10
000101
(= 0x85).
Note that this encoding 'technique' works because the characters you want to capitalize have their value (code point) between 0x80 and 0x7FF. It changes depending of how high the value is, see documentation here.
I have rewritten your to_upper
function accoding to what I have written so far:
string to_upper(string nazwa)
{
for (int i = 0; i < (int)nazwa.size(); i++)
{
// Getting the current character we are working with
char chr1 = nazwa[i];
// We want to find UTF-8-encoded polish letters here
// So we are looking for a character that has first three bits set to 110,
// as all polish letters encoded in UTF-8 are in UTF-8 Class 1 and therefore
// are two bytes long, the first byte being of binary value 110xxxxx
if(((chr1 >> 5) & 0b111) != 0b110) {
nazwa[i] = toupper(chr1); // Do the std toupper here for regular characters
continue;
}
// If we are here, then the character we are dealing with is two bytes long, so get its value.
// We won't need to check for that second byte during next iteration, so we increment i
i++;
char chr2 = nazwa[i];
// Get the unicode value of the encoded character
uint16_t fullChr = ((chr1 & 0b11111) << 6) | (chr2 & 0b111111);
// Get the various conditions to check for lowercase code points
bool lowercaseIsOdd = (fullChr >= 0x100 && fullChr <= 0x137) || (fullChr >= 0x14A && fullChr <= 0x177);
bool lowercaseIsEven = (fullChr >= 0x139 && fullChr <= 0x148) || (fullChr >= 0x179 && fullChr <= 0x17E);
bool chrIndexIsOdd = (fullChr % 2) == 1;
// Depending of whether the code point needs to be odd or even to be lowercase and depending of if the code point
// is odd or even, decrease it by one to make it uppercase
if((lowercaseIsOdd && chrIndexIsOdd)
|| (lowercaseIsEven && !chrIndexIsOdd))
fullChr--;
// Support for some additional, more commonly used accented letters
if(fullChr >= 0xE0 && fullChr <= 0xF6)
fullChr -= 0x20;
// Re-encode the character point in UTF-8
nazwa[i-1] = (0b110 << 5) | ((fullChr >> 6) & 0b11111); // We incremented i earlier, so subtract one to edit the first byte of the letter we're encoding
nazwa[i] = (0b10 << 6) | (fullChr & 0b111111);
}
return nazwa;
}
Note: don't forget to #include <cstdint>
for uint16_t
to work.
Note 2: I have added support for some Latin 1 Supplement (archive) letters because you asked for it in comments. Although we subtract 0x20
from lowercase code points to get the uppercase ones, it is pretty much the same principle as for other letters I have covered in this answer.
I have included lots of comments in my code, please consider reading them for a better understanding.
I have tested it with the string "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž"
and it converted it to "ĀĀĂĂĄĄĆĆĈĈĊĊČČĎĎĐĐĒĒĔĔĖĖĘĘĚĚĜĜĞĞĠĠĢĢĤĤĦĦĨĨĪĪĬĬĮĮİİIJIJĴĴĶĶĸĹĹĻĻĽĽĿĿŁŁŃŃŅŅŇŇŊŊŌŌŎŎŐŐŒŒŔŔŖŖŘŘŚŚŜŜŞŞŠŠŢŢŤŤŦŦŨŨŪŪŬŬŮŮŰŰŲŲŴŴŶŶŸŹŹŻŻŽŽ"
, so it works perfectly:
int main() {
string str1 = "ĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňŊŋŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽž";
string str2 = to_upper(str1);
printf("str1: %s\n", str1.c_str());
printf("str2: %s\n", str2.c_str());
}
Note: All terminals use UTF-8 by default, Qt labels as well, basically EVERYTHING uses UTF-8, EXCEPT the Windows CMD, so if you are testing the above code on a Windows CMD or Powershell, you need to change them to UTF-8 using command chcp 65001
, or by adding a Windows API call to change the CMD encoding when you execute your code.
Note 2: When you write raw strings directly in your code, your compiler will encode them in UTF-8 by default. Which is why my version of the to_upper
function works with polish letters directly written in code without further modifications. When I say that EVERYTHING uses UTF-8, I mean it.
Note 3: I kept it to avoid causing problems with your current code, but you use string
instead of std::string
, implying that you have a using namespace std;
somewhere in your code. In which case, please see Why is "using namespace std;" considered bad practice?
Please keep in mind that my answer is very specific to your case. It aims to, as you asked for, capitalize polish letters.
Other answers rely on std
features which are apparently more universal and work with all languages, so I'd invite you to give them a look.
It's always better to rely on existing features rather than reinventing the wheel, but I think it's also good to have a self-made alternative that might be easier to understand and sometimes is more efficient.