c++winapiunicodedesktop-applicationmbcs

How do I write MBCS files from a UNICODE application?


My question seems to have confused folks. Here's something concrete:

Our code does the following:

FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);

Under MBCS build target, the above produces a properly encoded file for code page 932 (assuming that 932 was the system default code page when this was run).

Under UNICODE build target, the above produces a garbage file full of ????.

I want to define a symbol, or use a compiler switch, or include a special header, or link to a given library, to make the above continue to work when the build target is UNICODE without changing the source code.

Here's the question as it used to exist:

FILE* streams can be opened in t(ranslated) or b(inary) modes. Desktop applications can be compiled for UNICODE or MBCS (under Windows).

If my application is compiled for MBCS, then writing MBCS strings to a "wt" stream results in a well-formed text file containing MBCS text for the system code page (i.e. the code page "for non Unicode software").

Because our software generally uses the _t versions of most string & stream functions, in MBCS builds output is handled primarily by puts(pszMBString) or something similar putc etc. Since pszMBString is already in the system code page (e.g. 932 when running on a Japanese machine), the string is written out verbatim (although line terminators are massaged by puts and gets automatically).

However, if my application is compiled for UNICODE, then writing MBCS strings to a "wt" stream results in garbage (lots of "?????" characters) (i.e. I convert the UNICODE to the system's default code page and then write that to the stream using, for example, fwrite(pszNarrow, 1, length, stream)).


I can open my streams in binary mode, in which case I'll get the correct MBCS text... but, the line terminators will no longer be PC-style CR+LF, but instead will be UNIX-style LF only. This, because in binary (non-translated) mode, the file stream doesn't handle the LF->CR+LF translation.


But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct line terminators and MBCS text files using the system's code page.

Obviously I can manually adjust the line terminators myself and use binary streams. However, this is a very invasive approach, as I now have to find every bit of code throughout the system that writes text files, and alter it so that it does all of this correctly. What blows my mind, is that UNICODE target is stupider / less capable than the MBCS target we used to use! Surely there is a way to toggle the C library to say "output narrow strings as-is but handle line terminators properly, exactly as you'd do in MBCS builds"?!


Solution

  • Sadly, this is a huge topic that deserves a small book devoted to it. And that book would basically need a specialized chapter for every target platform one wished to build for (Linux, Windows [flavor], Mac, etc.).

    My answer is only going to cover Windows desktop applications, compiled for C++ with or without MFC. Please Note: this pertains to wanting to read in and write out MBCS (narrow) files from a UNICODE build using the system default code page (i.e. the code page for non-Unicode software). If you want to read and write Unicode files from a UNICODE build, you must open the files in binary mode, and you must handle BOM and line feed conversions manually (i.e. on input, you must skip the BOM (if any), and both convert the external encoding to Windows Unicode [i.e. UTF-16LE] as well as convert any CR+LF sequences to LF only; and for output, you must write the BOM (if any), and convert from UTF-16LE to whatever target encoding you want, plus you must convert LF to CR+LF sequences for it to be a properly formatted PC text file).

    BEWARE of MS's std C library's puts and gets and fwrite and so on, which if opened in text/translated mode, will convert any 0x0D to a 0x0A 0x0D sequence on write, and vice verse on read, regardless of whether you're reading or writing a single byte, or a wide character, or a stream of random binary data -- it doesn't care, and all of these functions boil down to doing blind byte-conversions in text/translated mode!!!

    Also be aware that many of the Windows API functions use CP_ACP internally, without any external control over their behavior (e.g. WritePrivateProfileString()). Hence the reason one might want to ensure that all libraries are operating with the same character locale: CP_ACP and not some other one, since you can't control some of the functions behaviors, you're forced to conform to their choice or not use them at all.

    If using MFC, one needs to:

    // force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
    // this makes MFC's CString and CStdioFile and other interfaces use the
    // system default code page, instead of the thread default code page (which is normally "c")
    #define _CONVERSION_DONT_USE_THREAD_LOCALE  
    

    For C++ and C libraries, one must tell the libraries to use the system code page:

    // force C++ and C libraries based on setlocale() to use system locale for narrow strings
    // (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
    // we only change the LC_CTYPE, not collation or date/time formatting
    std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));
    

    I do the #define in all of my precompiled headers, before including any other headers. I set the global locale in main (or its moral equivalent), once for the entire program (you may need to call this for every thread that is going to do I/O or string conversions).

    The build target is UNICODE, and for most of our I/O, we use explicit string conversions before outputting via CStringA(my_wide_string).

    One other thing that one should be aware of, there are two different sets of multibyte functions in the C standard library under VS C++ - those which use the thread's locale for their operations, and another set which use something called the _setmbcp() (which you can query via _getmbcp(). This is the actual code page (not a locale) that is used for all narrow string interpretation (NOTE: this is always initialized to CP_ACP, i.e. GetACP() by the VS C++ startup code).

    Useful reference materials:
    - the-secret-family-split-in-windows-code-page-functions
    - Sorting it all out (explains that there are four different locales in effect in Windows)
    - MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
    - An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
    - Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
    - MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
    - Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software