For the sake of description, I provide a minimal reproduction of the following code:
#include <bits/stdc++.h>
#include <iostream>
#include <regex>
#include <string>
#include <string>
#include <Windows.h>
// GBK 转 UTF-8
std::string GBKToUTF8(const std::string& gbkStr) {
// 1. 先将 GBK 转换为宽字符(UTF-16)// Convert GBK to wide characters first (UTF-16)
int len = MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), -1, nullptr, 0);
std::wstring wstr(len, 0);
MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), -1, &wstr[0], len);
// 2. 将宽字符(UTF-16)转换为 UTF-8 // Convert wide characters (UTF-16) to UTF-8
len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, nullptr, 0, nullptr, nullptr);
std::string utf8Str(len, 0);
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &utf8Str[0], len, nullptr, nullptr);
return utf8Str;
}
int main() {
// 示例身份证号,长度为18 // Example ID number, length 18
std::string id_number = GBKToUTF8("610702199404261983");
// 检查字符串长度 // Check string length
std::cout << "Length before: " << id_number.length() << "\n"
<< id_number << std::endl;
// 正则表达式 // Regular expression
const std::regex id_number_pattern18("^([1-6][1-9]|50)\\d{4}(18|19|20)\\d{2}((0[1-9])|10|11|12)(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]$");
// 进行匹配 // Make a match
if (std::regex_match(id_number, id_number_pattern18)) {
std::cout << "Match successful!" << std::endl;
} else {
std::cout << "Match failed!" << std::endl;
}
return 0;
}
The problem now is that when the id_number
string is transcoded into UTF-8, the length changes from 18 to 19. Also, the regex doesn't match the string correctly anymore (it can be matched properly if it is not transcoded).
I suspect that the string was transcoded and some invisible characters were added, but I don't know how to fix this.
Here are some screenshots of VS2022 (ISO C++17) debugging for reference (of course, the screenshots are not from the minimal reproduction code, but they should be well understood):
I don't know how to do this at the moment, or I'd like to provide a solution and a description of how the problem arises.
The problem is that you are asking MultiByteToWideChar()
and WideCharToMultiByte()
to include space for an explicit NUL terminator in the length that they return:
[in] cbMultiByte
Size, in bytes, of the string indicated by the lpMultiByteStr parameter. Alternatively, this parameter can be set to -1 if the string is null-terminated. Note that, if cbMultiByte is 0, the function fails.
If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting Unicode string has a terminating null character, and the length returned by the function includes this character.
You are including that extra space when allocating memory for the std::wstring
and std::string
. But, unlike C strings, C++ strings are not null-terminated. They can contain embedded NUL characters which ARE included in their size
, and have an implicit NUL terminator which is NOT included in their size
.
So, you should not treat the C++ strings as being null-terminated. Do not ask the API for space for a NUL terminator. Use the actual string sizes instead, eg:
std::string GBKToUTF8(const std::string& gbkStr) {
// 1. 先将 GBK 转换为宽字符(UTF-16)
int len = MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), gbkStr.size(), nullptr, 0);
// ^^^^^^^^^^^^^
std::wstring wstr(len, 0);
MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), gbkStr.size(), &wstr[0], len);
// ^^^^^^^^^^^^^
// 2. 将宽字符(UTF-16)转换为 UTF-8
len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.size(), nullptr, 0, nullptr, nullptr);
// ^^^^^^^^^^^
std::string utf8Str(len, 0);
WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.size(), &utf8Str[0], len, nullptr, nullptr);
// ^^^^^^^^^^^
return utf8Str;
}