String is corrupted after being transcoded

#include <bits/stdc++.h> #include <iostream> #include <regex> #include <string> #include <string> #include <Windows.h> // GBK 转 UTF-8 std::string GBKToUTF8(const std::string& gbkStr) { // 1. 先将 GBK 转换为宽字符（UTF-16）// Convert GBK to wide characters first (UTF-16) int len = MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), -1, nullptr, 0); std::wstring wstr(len, 0); MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), -1, &wstr[0], len); // 2. 将宽字符（UTF-16）转换为 UTF-8 // Convert wide characters (UTF-16) to UTF-8 len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, nullptr, 0, nullptr, nullptr); std::string utf8Str(len, 0); WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &utf8Str[0], len, nullptr, nullptr); return utf8Str; } int main() { // 示例身份证号，长度为18 // Example ID number, length 18 std::string id_number = GBKToUTF8("610702199404261983"); // 检查字符串长度 // Check string length std::cout << "Length before: " << id_number.length() << "\n" << id_number << std::endl; // 正则表达式 // Regular expression const std::regex id_number_pattern18("^([1-6][1-9]|50)\\d{4}(18|19|20)\\d{2}((0[1-9])|10|11|12)(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]$"); // 进行匹配 // Make a match if (std::regex_match(id_number, id_number_pattern18)) { std::cout << "Match successful!" << std::endl; } else { std::cout << "Match failed!" << std::endl; } return 0; }

The problem is that you are asking MultiByteToWideChar() and WideCharToMultiByte() to include space for an explicit NUL terminator in the length that they return:

[in] cbMultiByte

Size, in bytes, of the string indicated by the lpMultiByteStr parameter. Alternatively, this parameter can be set to -1 if the string is null-terminated. Note that, if cbMultiByte is 0, the function fails.

If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting Unicode string has a terminating null character, and the length returned by the function includes this character.

You are including that extra space when allocating memory for the std::wstring and std::string. But, unlike C strings, C++ strings are not null-terminated. They can contain embedded NUL characters which ARE included in their size, and have an implicit NUL terminator which is NOT included in their size.

So, you should not treat the C++ strings as being null-terminated. Do not ask the API for space for a NUL terminator. Use the actual string sizes instead, eg:

std::string GBKToUTF8(const std::string& gbkStr) {
    // 1. 先将 GBK 转换为宽字符（UTF-16）
    int len = MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), gbkStr.size(), nullptr, 0);
                                                          // ^^^^^^^^^^^^^
    std::wstring wstr(len, 0);
    MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), gbkStr.size(), &wstr[0], len);
                                                // ^^^^^^^^^^^^^

    // 2. 将宽字符（UTF-16）转换为 UTF-8
    len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.size(), nullptr, 0, nullptr, nullptr);
                                                     // ^^^^^^^^^^^
    std::string utf8Str(len, 0);
    WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.size(), &utf8Str[0], len, nullptr, nullptr);
                                               // ^^^^^^^^^^^

    return utf8Str;
}

Recommended topics

Hot tags