c++stringunicodeqstringqchar

QChar::isLetterOrNumber() fails


I want to convert QStrings into filenames. Since I'd like the filename to look clean, I want to replace all non-letters and non-numbers by an underscore. The following code should do that.

#include <iostream>
#include <QString>

QString makeFilename(const QString& title)
{
    QString result;
    for(QString::const_iterator itr = title.begin(); itr != title.end(); itr++)
     result.push_back(itr->isLetterOrNumber()?itr->toLower():'_');
    return result;
}

int main()
{
    QString str = "§";
    std::cout << makeFilename(str).toAscii().data() << std::endl;
}

However, on my computer, this does not work, I get as an output:

�_

Looking for an explentation, debugging tells me that QString("§").size() = 2 > 1 = QString("a").size().

My questions:


Solution

  • In addition to what others have said, keep in mind that a QString is a UTF-16 encoded string. A Unicode character that is outside of the BMP requires 2 QChar values working together, called a surrogate pair, in order to encode that character. The QString documentation says as much:

    Unicode characters with code values above 65535 are stored using surrogate pairs, i.e., two consecutive QChars.

    You are not taking that into account when looping through the QString. You are looking at each QChar individually without checking if it belongs to a surrogate pair or not.

    Try this instead:

    QString makeFilename(const QString& title) 
    { 
        QString result; 
    
        QString::const_iterator itr = title.begin();
        QString::const_iterator end = title.end();
    
        while (itr != end)
        {
            if (!itr->isHighSurrogate())
            {
                if (itr->isLetterOrNumber())
                {
                    result.push_back(itr->toLower()); 
                    ++itr;
                    continue;
                }
            }
            else
            {
                ++itr;
                if (itr == end)
                    break; // error - missing low surrogate
    
                if (!itr->isLowSurrogate())
                    break; // error - not a low surrogate
    
                /*
                letters/numbers should not need to be surrogated,
                but if you want to check for that then you can use
                QChar::surrogateToUcs4() and QChar::category() to
                check if the surrogate pair represents a Unicode
                letter/number codepoint...
    
                uint ch = QChar::surrogateToUcs4(*(itr-1), *itr);
                QChar::Category cat = QChar::category(ch);
                if (
                    ((cat >= QChar::Number_DecimalDigit) && (cat <= QChar::Number_Other)) ||
                    ((cat >= QChar::Letter_Uppercase) && (cat <= QChar::Letter_Other))
                    )
                {
                    result.push_back(QChar(ch).toLower()); 
                    ++itr;
                    continue;
                }
                */
            }
    
            result.push_back('_');
            ++itr; 
        }
    
        return result; 
    }