stringunicodeencodingstringstreamstdstring

Unicode characters, C++ and libcurl


I use stringstream and libcurl to download data. I have a function for parsing too.

bool parse()
{
    istringstream temp(buff.str());
    buff.str("");
    string line;
    QString line_QStr, lyrics_QStr;
    while (temp.good())
    {
        getline(temp, line);
        if (QString::fromStdString(line).contains(startMarker)) break;
    }
    if (!temp.good()) return false; // something went wrong

    while (temp.good())
    {
        getline(temp, line);
        if ((line_QStr = QString::fromStdString(line)).contains(endMarker))
        {
            lyrics_QStr += line_QStr.remove(endMarker); // remove the </div>
            break;
        }
        else
        {
            lyrics_QStr += line_QStr;
        }
    }

    if (!temp.good()) return false;

    QTextDocument lyricsHtml;
    lyricsHtml.setHtml(lyrics_QStr);
    lyrics_qstr = lyricsHtml.toPlainText();
    return true;
}

When the text is ascii-only is ok. But if it's unicode, then I'm losing the unicode chars somewhere in this function. And it comes out something like this:

Unicode chars are messed up

I use string and getline instead of QTextStream and QString, as I couldn't find any counterpart of good() function so I couldn't make any decent error handling.

What am I doing wrong in this function that the unicode chars are lost and are displayed as 2 other chars? How can I fix it? Thanks in advance!

EDIT: I changed the parse function to this:

bool LyricsManiaDownloader::parse()
{
    wistringstream temp(string2wstring(buff.str()));
    buff.str("");
    wstring line;
    QString line_QStr, lyrics_QStr;
    while (temp.good())
    {
        getline(temp, line);
        if (QString::fromStdWString(line).contains(startMarker)) break;
    }
    if (!temp.good()) return false; // something went wrong

    while (temp.good())
    {
        getline(temp, line);
        if ((line_QStr = QString::fromStdWString(line)).contains(endMarker))
        {
            lyrics_QStr += line_QStr.remove(endMarker); // remove the </div>
            break;
        }
        else
        {
            lyrics_QStr += line_QStr;
        }
    }

    if (!temp.good()) return false;

    QTextDocument lyricsHtml;
    lyricsHtml.setHtml(lyrics_QStr);
    lyrics_qstr = lyricsHtml.toPlainText();
    return true;
}

And the string2wstring function is

wstring string2wstring(const string &str)
{
    wstring wstr(str.length(), L' ');
    copy(str.begin(), str.end(), wstr.begin());
    return wstr;
}

And there's still some problem with encoding.

EDIT2: I use this function for saving data into a stringstream

size_t write_data_to_var(char *ptr, size_t size, size_t nmemb, void *userdata)
{
    ostringstream * stream = (ostringstream*) userdata;
    size_t count = size * nmemb;
    stream->write(ptr, count);
    return count;
}

I pass the std::ostringstream buff to curl, and the web page data is saved here. Then I use a wistringstream, convert buff.str() to wstring and use it as a source for wistringstream. The conversion from std::string to std::wstring is the decoding, isn't it?


Solution

  • The Web server returns a stream of bytes alongside a header that indicates what encoding those bytes should be understood as.