c++unicodecomtype-conversionbstr

How to convert CComVariant bstr to CString


I'm a newbie with C++ and I've taken over a COM project to fix some issues. The current issue I'm working on is handling UTF8 strings. I have this piece of code:

// CString strValue;
CStringW strValue; 
CComVariant* val = &(*result)[i].minValue;
switch (val->vt)
{
case VT_BSTR:   
    //strValue = OLE2CA(val->bstrVal);
    strValue = OLE2W(val->bstrVal); // Works
    (*result)[i].name = strValue; // Works
    (*result)[i].expression = "[" + fieldName + "] = \"" + strValue + "\""; // fails
    break;
case VT_R8:     
    //strValue.Format("%g", val->dblVal);
    strValue.Format(L"%g", val->dblVal); // Works
    (*result)[i].name = strValue; // Works
    (*result)[i].expression = "[" + fieldName + "] = " + strValue; // fails
    break;
case VT_I4:     
    //strValue.Format("%i", val->lVal);
    strValue.Format(L"%i", val->lVal); // Works
    (*result)[i].name = strValue; // Works
    (*result)[i].expression = "[" + fieldName + "] = " + strValue; // fails
    break;
}

struct CategoriesData
{
    public:
    CComVariant minValue;
    CComVariant maxValue;
    //CString expression;
    CStringW expression;
    //CString name;
    CStringW name;
    tkCategoryValue valueType;
    int classificationField;
    bool skip;
};

The problem is with this line strValue = OLE2CA(val->bstrVal); When val->bstrVal is an unicode string like this Russian text Воздух strValue is converted into ?????.

I tried several approached and searched the internet, but can't get strValue to be Воздух. Can a CString contain this kind of text or should I change to another type? Is so which one?

minValue can be a VT_BSTR, a VT_R8 or a VT_I4.

These are the options I tried so far:

strValue = val->bstrVal;
strValue = Utility::ConvertFromUtf8(val->bstrVal);
strValue = Utility::ConvertToUtf8(val->bstrVal);
temp = Utility::ConvertBSTRToLPSTR(val->bstrVal);
strValue = W2BSTR(Utility::ConvertFromUtf8(temp));
strValue = W2BSTR(val->bstrVal);                
strValue = CW2A(val->bstrVal);
strValue = (CString)val->bstrVal;
strValue = Utility::ConvertToUtf8(OLE2W(val->bstrVal));

Edit The code for the helper functions:

CStringA ConvertToUtf8(CStringW unicode) {
    USES_CONVERSION;
    CStringA utf8 = CW2A(unicode, CP_UTF8);
    return utf8;
}

CStringW ConvertFromUtf8(CStringA utf8) {
    USES_CONVERSION;
    CStringW unicode = CA2W(utf8, CP_UTF8);
    return unicode;
}

char* ConvertBSTRToLPSTR (BSTR bstrIn)
{
  LPSTR pszOut = NULL;
  if (bstrIn != NULL)
  {
    int nInputStrLen = SysStringLen (bstrIn);

    // Double NULL Termination
    int nOutputStrLen = WideCharToMultiByte(CP_ACP, 0, bstrIn, nInputStrLen, NULL, 0, 0, 0) + 2; 
    pszOut = new char [nOutputStrLen];

    if (pszOut)
    {
      memset (pszOut, 0x00, sizeof (char)*nOutputStrLen);
      WideCharToMultiByte (CP_ACP, 0, bstrIn, nInputStrLen, pszOut, nOutputStrLen, 0, 0);
    }
  }
  return pszOut;
}

Edit2 I added my complete switch statement. When I change strValue from CString to CStringW I get errors for the other cases, like strValue.Format("%g", val->dblVal); How to solve this?

Edit3 I already fixed a similar issue, but that was converting to VARIANT not from:

    val->vt = VT_BSTR;
    const char* v = DBFReadStringAttribute(_dbfHandle, _rows[RowIndex].oldIndex, _fields[i]->oldIndex);
    // Old code, not unicode ready:
    //WCHAR *buffer = Utility::StringToWideChar(v);
    //val->bstrVal = W2BSTR(buffer);
    //delete[] buffer;              
    // New code, unicode friendly:
    val->bstrVal = W2BSTR(Utility::ConvertFromUtf8(v)); 

Edit4 Thanks to all the help so far I managed to make some changes. I've updated my initial code in this post and added all code of the function. I'm now stuck with this line:

 (*result)[i].expression = "[" + fieldName + "] = \"" + strValue + "\"";    

I can't concatenate CStringW values.

Some more background info: The function is part of MapWinGIS, an Open Source GIS application, where you can show maps (shapefiles). These maps have attribute data. This data is stored in DBase IV format and can hold unicode/UTF-8 text. I already made a fix (see Edit3) to show this text properly in a grid view. The function I'm struggling now is categorizing (grouping) the data to, for example give similar values the same color. This category has a name and an expression. This expression is later on parsed to do the actual grouping. For example I have a map with states and I want to give each state a different color. As mentioned before, I'm new to C++ and am really outside my comfort zone. I really appreciate all the help you have given me. I hope you will help me once more.


Solution

  • BSTRs "naturally" store Unicode UTF-16 length-prefixed strings, although you could "stretch" a BSTR and store with it a more generic length-prefixed sequence of bytes (but I don't like this usage).

    (For more details on BSTRs, you will find this blog post by Eric Lippert very interesting.)

    So, I'm considering the normal usage of BSTR, which is storing length-prefixed UTF-16 strings.

    If you want to convert a UTF-16 string stored in a BSTR to a UTF-8 string, you can use the WideCharToMultiByte Win32 API with the CP_UTF8 flag (see e.g. this MSDN Magazine article for details, and this reusable code on GitHub).

    You can store the destination UTF-8 string in instances of the std::string class.

    P.S. If you want to use CStringW for UTF-16 and CStringA for UTF-8 strings, and the ATL CW2A helper for UTF-16/8 conversions, note that you don't need the USES_CONVERSION macro in your code; and you could just take input strings by const& (const reference) as good code hygiene:

    CStringA Utf8FromUtf16(const CStringW &utf16) {
        CStringA utf8 = CW2A(utf16, CP_UTF8);
        return utf8;
    }
    

    RE Edit 2

    Try strValue.Format(L"%g",... with CStringW. The L prefix generates a Unicode UTF-16 string literal for CStringW::Format.

    RE Edit 4

    I replied to that in the comments, but for the sake of completeness, to concatenate string literals with CStringW instances, consider decorating these literals with L"...": this defines a Unicode UTF-16 string literal, which is wchar_t-based, and works fine with CStringW objects.

    (*result)[i].expression = L"[" + fieldName + L"] = \"" + strValue + L"\"";