c++winapivisual-c++character-encodingshift-jis

Converting shift-jis encoded file to to utf-8 in c++


I am trying with below code to convert from shift-jis file to utf-8, but when we open the output file it has corrupted characters, looks like something is missed out here, any thoughts?

// From file
FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
int nLen = _filelength(fileno(shiftJisFile));
LPSTR lpszBuf = new char[nLen];
fread(lpszBuf, 1, nLen, shiftJisFile);

// convert multibyte to  wide char
int utf16size = ::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, 0, 0);
LPWSTR pUTF16 = new WCHAR[utf16size];
::MultiByteToWideChar(CP_ACP, 0, lpszBuf, -1, pUTF16, utf16size);

wstring str(pUTF16);

// convert wide char to multi byte utf-8 before writing to a file
fstream File("filepath", std::ios::out);
string result = string();
result.resize(WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, NULL, 0, 0, 0));
char* ptr = &result[0];
WideCharToMultiByte(CP_UTF8, 0, str.c_str(), -1, ptr, result.size(), 0, 0);
File << result;

File.close();

Solution

  • There are multiple problems.

    The first problem is that when you are writing the output file, you need to set it to binary for the same reason you need to do so when reading the input.

    fstream File("filepath", std::ios::out | std::ios::binary);
    

    The second problem is that when you are reading the input file, you are only reading the bytes of the input stream and treat them like a string. However, those bytes do not have a terminating null character. If you call MultiByteToWideChar with a -1 length, it infers the input string length from the terminating null character, which is missing in your case. That means both utf16size and the contents of pUTF16 are already wrong. Add it manually after reading the file:

    int nLen = _filelength(fileno(shiftJisFile));
    LPSTR lpszBuf = new char[nLen+1];
    fread(lpszBuf, 1, nLen, shiftJisFile);
    lpszBuf[nLen] = 0;
    

    The last problem is that you are using CP_ACP. That means "the current code page". In your question, you were specifically asking how to convert Shift-JIS. The code page Windows uses for its closes equivalent to what is commonly called "Shift-JIS" is 932 (you can look that up on wikipedia for example). So use 932 instead of CP_ACP:

    int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
    LPWSTR pUTF16 = new WCHAR[utf16size];
    ::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
    

    Additionally, there is no reason to create wstring str(pUTF16). Just use pUTF16 directly in the WideCharToMultiByte calls.

    Also, I'm not sure how kosher char *ptr = &result[0] is. I personally would not create a string specifically as a buffer for this.

    Here is the corrected code. I would personally not write it this way, but I don't want to impose my coding ideology on you, so I made only the changes necessary to fix it:

    // From file
    FILE* shiftJisFile = _tfopen(lpszShiftJs, _T("rb"));
    int nLen = _filelength(fileno(shiftJisFile));
    LPSTR lpszBuf = new char[nLen+1];
    fread(lpszBuf, 1, nLen, shiftJisFile);
    lpszBuf[nLen] = 0;
    
    // convert multibyte to  wide char
    int utf16size = ::MultiByteToWideChar(932, 0, lpszBuf, -1, 0, 0);
    LPWSTR pUTF16 = new WCHAR[utf16size];
    ::MultiByteToWideChar(932, 0, lpszBuf, -1, pUTF16, utf16size);
    
    // convert wide char to multi byte utf-8 before writing to a file
    fstream File("filepath", std::ios::out | std::ios::binary);
    string result;
    result.resize(WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, NULL, 0, 0, 0));
    char *ptr = &result[0];
    WideCharToMultiByte(CP_UTF8, 0, pUTF16, -1, ptr, result.size(), 0, 0);
    File << ptr;
    
    File.close();
    

    Also, you have a memory leak -- lpszBuf and pUTF16 are not cleaned up.