String is corrupted after being transcoded
Asked Answered
B

1

11

For the sake of description, I provide a minimal reproduction of the following code:

#include <bits/stdc++.h>
#include <iostream>
#include <regex>
#include <string>
#include <string>
#include <Windows.h>

// GBK 转 UTF-8
std::string GBKToUTF8(const std::string& gbkStr) {
    // 1. 先将 GBK 转换为宽字符(UTF-16)// Convert GBK to wide characters first (UTF-16)
    int len = MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), -1, nullptr, 0);
    std::wstring wstr(len, 0);
    MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), -1, &wstr[0], len);

    // 2. 将宽字符(UTF-16)转换为 UTF-8 // Convert wide characters (UTF-16) to UTF-8
    len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, nullptr, 0, nullptr, nullptr);
    std::string utf8Str(len, 0);
    WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, &utf8Str[0], len, nullptr, nullptr);

    return utf8Str;
}

int main() {
    // 示例身份证号,长度为18 // Example ID number, length 18
    std::string id_number = GBKToUTF8("610702199404261983");  
    // 检查字符串长度 // Check string length
    std::cout << "Length before: " << id_number.length() << "\n"
        << id_number << std::endl;

    // 正则表达式 // Regular expression
    const std::regex id_number_pattern18("^([1-6][1-9]|50)\\d{4}(18|19|20)\\d{2}((0[1-9])|10|11|12)(([0-2][1-9])|10|20|30|31)\\d{3}[0-9Xx]$");

    // 进行匹配 // Make a match
    if (std::regex_match(id_number, id_number_pattern18)) {
        std::cout << "Match successful!" << std::endl;
    } else {
        std::cout << "Match failed!" << std::endl;
    }

    return 0;
}

The problem now is that when the id_number string is transcoded into UTF-8, the length changes from 18 to 19. Also, the regex doesn't match the string correctly anymore (it can be matched properly if it is not transcoded).

I suspect that the string was transcoded and some invisible characters were added, but I don't know how to fix this.

Here are some screenshots of VS2022 (ISO C++17) debugging for reference (of course, the screenshots are not from the minimal reproduction code, but they should be well understood):

Before transcoding
image

After transcoding
image

I don't know how to do this at the moment, or I'd like to provide a solution and a description of how the problem arises.

Bronnie answered 15/8 at 14:33 Comment(11)
Shold probably be std::string utf8Str(len - 1, 0); as len seems to count final '\0' for C-string.Thalweg
You're requesting the APIs to convert the final NUL character in your strings, hence your output strings now contain two instead of just one NUL terminators, one provided by the std::[w]string and one in the controlled sequence. The solution is simple: Pass size() instead of -1 as the length. That's also less costly.Affrica
From the documentation: "If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting character string has a terminating null character, and the length returned by the function includes this character."Eyestrain
I am unsure of the logic here. Why are you using the ANSI Code Page?Airline
@Dúthomhas The source string is in CP 936, which for this app is the default 8-bit code page.Eyestrain
@RaymondChen Yes, but the function name says that it takes a Simplified Chinese GBK string.Airline
@Dúthomhas I think we're agreeing. Simplified Chinese GBK is an 8-bit encoding, which Windows calls CP 936.Eyestrain
Unrelated note: Seeing #include <bits/stdc++.h> mixed in with includes of other Standard library headers suggests you may not know what #include <bits/stdc++.h> does. Here is a bit of reading on the subject along with reasons why you should avoid using that header.Gerdes
建议不要随便把身份证号发到网上。要例子可以随便用个字符串,一样可以重现问题Whitebait
It's not a "minimal" piece of code. Actually you only need cout << GBKToUTF8("").size()Whitebait
@Whitebait 谢谢你的好意提醒,但是这个身份证号码是我通过代码随机生成的,应该并不具备实际的作用。Bronnie
F
16

The problem is that you are asking MultiByteToWideChar() and WideCharToMultiByte() to include space for an explicit NUL terminator in the length that they return:

[in] cbMultiByte

Size, in bytes, of the string indicated by the lpMultiByteStr parameter. Alternatively, this parameter can be set to -1 if the string is null-terminated. Note that, if cbMultiByte is 0, the function fails.

If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting Unicode string has a terminating null character, and the length returned by the function includes this character.

You are including that extra space when allocating memory for the std::wstring and std::string. But, unlike C strings, C++ strings are not null-terminated. They can contain embedded NUL characters which ARE included in their size, and have an implicit NUL terminator which is NOT included in their size.

So, you should not treat the C++ strings as being null-terminated. Do not ask the API for space for a NUL terminator. Use the actual string sizes instead, eg:

std::string GBKToUTF8(const std::string& gbkStr) {
    // 1. 先将 GBK 转换为宽字符(UTF-16)
    int len = MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), gbkStr.size(), nullptr, 0);
                                                          // ^^^^^^^^^^^^^
    std::wstring wstr(len, 0);
    MultiByteToWideChar(CP_ACP, 0, gbkStr.c_str(), gbkStr.size(), &wstr[0], len);
                                                // ^^^^^^^^^^^^^

    // 2. 将宽字符(UTF-16)转换为 UTF-8
    len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.size(), nullptr, 0, nullptr, nullptr);
                                                     // ^^^^^^^^^^^
    std::string utf8Str(len, 0);
    WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.size(), &utf8Str[0], len, nullptr, nullptr);
                                               // ^^^^^^^^^^^

    return utf8Str;
}
Fluke answered 15/8 at 14:43 Comment(3)
The "alternative" is the only correct solution.Affrica
@Affrica agreed. I have removed the other exampleFluke
That's better, thank you. Since C++ strings can contain embedded NUL characters there's no reason to avoid supporting this. While doing that is generally not a good idea, NUL characters can wind up in the controlled sequence by accident (as illustrated in the question). By supporting embedded NUL characters the implementation no longer masks bugs in other parts of the code.Affrica

© 2022 - 2024 — McMap. All rights reserved.