c++boostunicodecodecvt

Stumped with Unicode, Boost, C++, codecvts


In C++, I want to use Unicode to do things. So after falling down the rabbit hole of Unicode, I've managed to end up in a train wreck of confusion, headaches and locales.

But in Boost I've had the unfortunate problem of trying to use Unicode file paths and trying to use the Boost program options library with Unicode input. I've read whatever I could find on the subjects of locales, codecvts, Unicode encodings and Boost.

My current attempt to get things to work is to have a codecvt that takes a UTF-8 string and converts it to the platform's encoding (UTF-8 on POSIX, UTF-16 on Windows), I've been trying to avoid wchar_t.

The closest I've actually gotten is trying to do this with Boost.Locale, to convert from a UTF-8 string to a UTF-32 string on output.

#include <string>
#include <boost/locale.hpp>
#include <locale>

int main(void)
{
  std::string data("Testing, 㤹");

  std::locale fromLoc = boost::locale::generator().generate("en_US.UTF-8");
  std::locale toLoc   = boost::locale::generator().generate("en_US.UTF-32");

  typedef std::codecvt<wchar_t, char, mbstate_t> cvtType;
  cvtType const* toCvt = &std::use_facet<cvtType>(toLoc);

  std::locale convLoc = std::locale(fromLoc, toCvt);

  std::cout.imbue(convLoc);
  std::cout << data << std::endl;

  // Output is unconverted -- what?

  return 0;
}

I think I had some other kind of conversion working using wide characters, but I really don't know what I'm even doing. I don't know what the right tool for the job is at this point. Help?


Solution

  • Okay, after a long few months I've figured it out, and I'd like to help people in the future.

    First of all, the codecvt thing was the wrong way of doing it. Boost.Locale provides a simple way of converting between character sets in its boost::locale::conv namespace. Here's one example (there's others not based on locales).

    #include <boost/locale.hpp>
    namespace loc = boost::locale;
    
    int main(void)
    {
      loc::generator gen;
      std::locale blah = gen.generate("en_US.utf-32");
    
      std::string UTF8String = "Tésting!";
      // from_utf will also work with wide strings as it uses the character size
      // to detect the encoding.
      std::string converted = loc::conv::from_utf(UTF8String, blah);
    
      // Outputs a UTF-32 string.
      std::cout << converted << std::endl;
    
      return 0;
    }
    

    As you can see, if you replace the "en_US.utf-32" with "" it'll output in the user's locale.

    I still don't know how to make std::cout do this all the time, but the translate() function of Boost.Locale outputs in the user's locale.

    As for the filesystem using UTF-8 strings cross platform, it seems that that's possible, here's a link to how to do it.