c++c++11boostboost-locale

is there a way to detect chinese characters in c++ ? (using boost)


In a data processing project, i need to detect split words in chinese ( words in chinese dont contain spaces). Is there a way to detect chinese characters using a native c++ feature or boost.locale library ?


Solution

  • Here is my attempt using only boost and standard library:

    #include <iostream>
    #include <boost/regex/pending/unicode_iterator.hpp>
    #include <functional>
    #include <algorithm>
    
    using Iter = boost::u8_to_u32_iterator<std::string::const_iterator>;
    
    template <::boost::uint32_t a, ::boost::uint32_t b>
    class UnicodeRange
    {
        static_assert(a <= b, "Proper range");
    public:
        constexpr bool operator()(::boost::uint32_t x) const noexcept
        {
            return x >= a && x <= b;
        }
    };
    
    using UnifiedIdeographs = UnicodeRange<0x4E00, 0x9FFF>;
    using UnifiedIdeographsA = UnicodeRange<0x3400, 0x4DBF>;
    using UnifiedIdeographsB = UnicodeRange<0x20000, 0x2A6DF>;
    using UnifiedIdeographsC = UnicodeRange<0x2A700, 0x2B73F>;
    using UnifiedIdeographsD = UnicodeRange<0x2B740, 0x2B81F>;
    using UnifiedIdeographsE = UnicodeRange<0x2B820, 0x2CEAF>;
    using CompatibilityIdeographs = UnicodeRange<0xF900, 0xFAFF>;
    using CompatibilityIdeographsSupplement = UnicodeRange<0x2F800, 0x2FA1F>;
    
    constexpr bool isChineese(::boost::uint32_t x) noexcept
    {
        return UnifiedIdeographs{}(x) 
        || UnifiedIdeographsA{}(x) || UnifiedIdeographsB{}(x) || UnifiedIdeographsC{}(x) 
        || UnifiedIdeographsD{}(x) || UnifiedIdeographsE{}(x)
        || CompatibilityIdeographs{}(x) || CompatibilityIdeographsSupplement{}(x);
    }
    
    int main()
    {
        std::string s;
        while (std::getline(std::cin, s))
        {
            auto start = std::find_if(Iter{s.cbegin()}, Iter{s.cend()}, isChineese);
            auto stop = std::find_if_not(start, Iter{s.cend()}, isChineese);
            std::cout << std::string{start.base(), stop.base()} << '\n';
        }
        
        return 0;
    }
    

    https://wandbox.org/permlink/FtxKa8D2LtR3ko9t

    Probably you should be able to polish that approach to something fully functional. I do not know how to properly cover this by tests and not sure which characters should be included in this check.