Get SHA1 of Unicode string in Crypto++
Asked Answered
S

3

6

I study C++ independently and I have one problem, which I can't solve more than week. I hope you can help me.

I need to get a SHA1 digest of a Unicode string (like Привет), but I don't know how to do that.

I tried to do it like this, but it returns a wrong digest!

For wstring('Ы') It returns - A469A61DF29A7568A6CC63318EA8741FA1CF2A7
I need - 8dbe718ab1e0c4d75f7ab50fc9a53ec4f0528373

Regards and sorry for my English :).

CryptoPP 5.6.2 MVC++ 2013

#include <iostream>
#include "cryptopp562\cryptlib.h"
#include "cryptopp562\sha.h"
#include "cryptopp562\hex.h"

int main() {

    std::wstring string(L"Ы");
    int bs_size = (int)string.length() * sizeof(wchar_t);

    byte* bytes_string = new byte[bs_size];

    int n = 0; //real bytes count
    for (int i = 0; i < string.length(); i++) {
        wchar_t wcharacter = string[i];

        int high_byte = wcharacter & 0xFF00;

        high_byte = high_byte >> 8;

        int low_byte = wcharacter & 0xFF;

        if (high_byte != 0) {
            bytes_string[n++] = (byte)high_byte;
        }

        bytes_string[n++] = (byte)low_byte;
    }

    CryptoPP::SHA1 sha1;
    std::string hash;

    CryptoPP::StringSource ss(bytes_string, n, true,
        new CryptoPP::HashFilter(sha1,
            new CryptoPP::HexEncoder(
                new CryptoPP::StringSink(hash)
            ) 
        ) 
    );

    std::cout << hash << std::endl;

    return 0;
}
Stilbestrol answered 20/4, 2015 at 18:8 Comment(1)
"I tried to do it like this, but it returns a wrong digest!" - The Crypto++ code looks good, so the problem probably lies elsewhere. What digest is it producing, and what digest do you expect? I suspect you need a digest of the wide string converted to UTF-8. UTF-8 is most interoperable. Add the expected and actual digest to your question by clicking Edit (and don't post it as a comment).Slavic
S
3

I need to get a SHA1 digest of a Unicode string (like Привет), but I don't know how to do that.

The trick here is you need to know how to encode the Unicode string. On Windows, a wchar_t is 2 octets; while on Linux a wchar_t is 4 otects. There's a Crypto++ wiki page on it at Character Set Considerations, but its not that good.

To interoperate most effectively, always use UTF-8. That means you convert UTF-16 or UTF-32 to UTF-8. Because you are on Windows, you will want to call WideCharToMultiByte function to convert it using CP_UTF8. If you were on Linux, then you would use libiconv.

Crypto++ has a built-in function called StringNarrow that uses C++. Its in the file misc.h. Be sure to call setlocale before using it.

Stack Overflow has a few question on using the Windows function . See, for example, How do you properly use WideCharToMultiByte.


I need - 8dbe718ab1e0c4d75f7ab50fc9a53ec4f0528373

What is the hash (SHA-1, SHA-256, ...)? Is it a HMAC (keyed hash)? Is the information salted (like a password in storage)? How is it encoded? I have to ask because I cannot reproduce your desired results:

SHA-1:   2805AE8E7E12F182135F92FB90843BB1080D3BE8
SHA-224: 891CFB544EB6F3C212190705F7229D91DB6CECD4718EA65E0FA1B112
SHA-256: DD679C0B9FD408A04148AA7D30C9DF393F67B7227F65693FFFE0ED6D0F0ADE59
SHA-384: 0D83489095F455E4EF5186F2B071AB28E0D06132ABC9050B683DA28A463697AD
         1195FF77F050F20AFBD3D5101DF18C0D
SHA-512: 0F9F88EE4FA40D2135F98B839F601F227B4710F00C8BC48FDE78FF3333BD17E4
         1D80AF9FE6FD68515A5F5F91E83E87DE3C33F899661066B638DB505C9CC0153D

Here's the program I used. Be sure to specify the length of the wide string. If you don't (and use -1 for the length), then WideCharToMultiByte will include the terminating ASCII-Z in its calculations. Since we are using a std::string, we don't need the function to include the ASCII-Z terminator.

int main(int argc, char* argv[])
{
    wstring m1 = L"Привет"; string m2;

    int req = WideCharToMultiByte(CP_UTF8, 0, m1.c_str(), (int)m1.length(), NULL, 0, NULL, NULL);
    if(req < 0 || req == 0)
        throw runtime_error("Failed to convert string");

    m2.resize((size_t)req);

    int cch = WideCharToMultiByte(CP_UTF8, 0, m1.c_str(), (int)m1.length(), &m2[0], (int)m2.length(), NULL, NULL);
    if(cch < 0 || cch == 0)
        throw runtime_error("Failed to convert string");

    // Should not be required
    m2.resize((size_t)cch);

    string s1, s2, s3, s4, s5;
    SHA1 sha1; SHA224 sha224; SHA256 sha256; SHA384 sha384; SHA512 sha512;

    HashFilter f1(sha1, new HexEncoder(new StringSink(s1)));
    HashFilter f2(sha224, new HexEncoder(new StringSink(s2)));
    HashFilter f3(sha256, new HexEncoder(new StringSink(s3)));
    HashFilter f4(sha384, new HexEncoder(new StringSink(s4)));
    HashFilter f5(sha512, new HexEncoder(new StringSink(s5)));

    ChannelSwitch cs;
    cs.AddDefaultRoute(f1);
    cs.AddDefaultRoute(f2);
    cs.AddDefaultRoute(f3);
    cs.AddDefaultRoute(f4);
    cs.AddDefaultRoute(f5);

    StringSource ss(m2, true /*pumpAll*/, new Redirector(cs));

    cout << "SHA-1:   " << s1 << endl;
    cout << "SHA-224: " << s2 << endl;
    cout << "SHA-256: " << s3 << endl;
    cout << "SHA-384: " << s4 << endl;
    cout << "SHA-512: " << s5 << endl;

    return 0;
}
Slavic answered 21/4, 2015 at 18:38 Comment(0)
E
3

You say ‘but it returns wrong digest’ – what are you comparing it with?

Key point: digests such as SHA-1 don't work with sequences of characters, but with sequences of bytes.

What you're doing in this snippet of code is generating an ad-hoc encoding of the unicode characters in the string "Ы". This encoding will (as it turns out) match the UTF-16 encoding if the characters in the string are all in the BMP (‘basic multilingual plane’, which is true in this case) and if the numbers that end up in wcharacter are integers representing unicode codepoints (which is sort-of probably correct, but not, I think, guaranteed).

If the digest you're comparing it with turns an input string into an sequence of bytes using the UTF-8 encoding (which is quite likely), then that will produce a different byte sequence from yours, so that the SHA-1 digest of that sequence will be different from the digest you calculate here.

So:

  • Check what encoding your test string is using.

  • You'd be best off using some library functions to specifically generate a UTF-16 or UTF-8 (as appropriate) encoding of the string you want to process, to ensure that the byte sequence you're working with is what you think it is.

There's an excellent introduction to unicode and encodings in the aptly-named document The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Exaltation answered 20/4, 2015 at 18:48 Comment(0)
S
3

I need to get a SHA1 digest of a Unicode string (like Привет), but I don't know how to do that.

The trick here is you need to know how to encode the Unicode string. On Windows, a wchar_t is 2 octets; while on Linux a wchar_t is 4 otects. There's a Crypto++ wiki page on it at Character Set Considerations, but its not that good.

To interoperate most effectively, always use UTF-8. That means you convert UTF-16 or UTF-32 to UTF-8. Because you are on Windows, you will want to call WideCharToMultiByte function to convert it using CP_UTF8. If you were on Linux, then you would use libiconv.

Crypto++ has a built-in function called StringNarrow that uses C++. Its in the file misc.h. Be sure to call setlocale before using it.

Stack Overflow has a few question on using the Windows function . See, for example, How do you properly use WideCharToMultiByte.


I need - 8dbe718ab1e0c4d75f7ab50fc9a53ec4f0528373

What is the hash (SHA-1, SHA-256, ...)? Is it a HMAC (keyed hash)? Is the information salted (like a password in storage)? How is it encoded? I have to ask because I cannot reproduce your desired results:

SHA-1:   2805AE8E7E12F182135F92FB90843BB1080D3BE8
SHA-224: 891CFB544EB6F3C212190705F7229D91DB6CECD4718EA65E0FA1B112
SHA-256: DD679C0B9FD408A04148AA7D30C9DF393F67B7227F65693FFFE0ED6D0F0ADE59
SHA-384: 0D83489095F455E4EF5186F2B071AB28E0D06132ABC9050B683DA28A463697AD
         1195FF77F050F20AFBD3D5101DF18C0D
SHA-512: 0F9F88EE4FA40D2135F98B839F601F227B4710F00C8BC48FDE78FF3333BD17E4
         1D80AF9FE6FD68515A5F5F91E83E87DE3C33F899661066B638DB505C9CC0153D

Here's the program I used. Be sure to specify the length of the wide string. If you don't (and use -1 for the length), then WideCharToMultiByte will include the terminating ASCII-Z in its calculations. Since we are using a std::string, we don't need the function to include the ASCII-Z terminator.

int main(int argc, char* argv[])
{
    wstring m1 = L"Привет"; string m2;

    int req = WideCharToMultiByte(CP_UTF8, 0, m1.c_str(), (int)m1.length(), NULL, 0, NULL, NULL);
    if(req < 0 || req == 0)
        throw runtime_error("Failed to convert string");

    m2.resize((size_t)req);

    int cch = WideCharToMultiByte(CP_UTF8, 0, m1.c_str(), (int)m1.length(), &m2[0], (int)m2.length(), NULL, NULL);
    if(cch < 0 || cch == 0)
        throw runtime_error("Failed to convert string");

    // Should not be required
    m2.resize((size_t)cch);

    string s1, s2, s3, s4, s5;
    SHA1 sha1; SHA224 sha224; SHA256 sha256; SHA384 sha384; SHA512 sha512;

    HashFilter f1(sha1, new HexEncoder(new StringSink(s1)));
    HashFilter f2(sha224, new HexEncoder(new StringSink(s2)));
    HashFilter f3(sha256, new HexEncoder(new StringSink(s3)));
    HashFilter f4(sha384, new HexEncoder(new StringSink(s4)));
    HashFilter f5(sha512, new HexEncoder(new StringSink(s5)));

    ChannelSwitch cs;
    cs.AddDefaultRoute(f1);
    cs.AddDefaultRoute(f2);
    cs.AddDefaultRoute(f3);
    cs.AddDefaultRoute(f4);
    cs.AddDefaultRoute(f5);

    StringSource ss(m2, true /*pumpAll*/, new Redirector(cs));

    cout << "SHA-1:   " << s1 << endl;
    cout << "SHA-224: " << s2 << endl;
    cout << "SHA-256: " << s3 << endl;
    cout << "SHA-384: " << s4 << endl;
    cout << "SHA-512: " << s5 << endl;

    return 0;
}
Slavic answered 21/4, 2015 at 18:38 Comment(0)
C
2

This seems to work fine for me.

Rather than fiddling about trying to extract the pieces I simply cast the wide character buffer to a const byte* and pass that (and the adjusted size) to the hash function.

int main() {

    std::wstring string(L"Привет");

    CryptoPP::SHA1 sha1;
    std::string hash;

    CryptoPP::StringSource ss(
        reinterpret_cast<const byte*>(string.c_str()), // cast to const byte*
        string.size() * sizeof(std::wstring::value_type), // adjust for size
        true,
        new CryptoPP::HashFilter(sha1,
            new CryptoPP::HexEncoder(
                new CryptoPP::StringSink(hash)
            )
        )
    );

    std::cout << hash << std::endl;

    return 0;
}

Output:

C6F8291E68E478DD5BD1BC2EC2A7B7FC0CEE1420

EDIT: To add.

The result is going to be encoding dependant. For example I ran this on Linux where wchar_t is 4 bytes. On Windows I believe wchar_t may be only 2 bytes.

For consistency it may be better to use UTF8 a store the text in a normal std::string. This also makes calling the API simpler:

int main() {

    std::string string("Привет"); // UTF-8 encoded

    CryptoPP::SHA1 sha1;
    std::string hash;

    CryptoPP::StringSource ss(
        string,
        true,
        new CryptoPP::HashFilter(sha1,
            new CryptoPP::HexEncoder(
                new CryptoPP::StringSink(hash)
            )
        )
    );

    std::cout << hash << std::endl;

    return 0;
}

Output:

2805AE8E7E12F182135F92FB90843BB1080D3BE8
Coe answered 20/4, 2015 at 18:37 Comment(4)
For me it produces digest AD5EF6AFD4BADE078F3E19FAE5E45A43635A18CB. May be problem with my IDE or project settings? Tell me please, what ide and project settings are you have?Stilbestrol
@DmitryAurokk I am using eclipse.com IDE on Linux. I think the poblem is likely to be the encoding. On Linux wchar_t is 4 bytes (UTF32 compatable). I think on Windows wchar_t is 2 bytes? (UTF16 ish). What I did was tested the program output against the command line program sha1sum and it gives the same result. If you want consistant results accross platforms then it may be better to use UTF8 and forget about wide strings?Coe
@DmitryAurokk I have added a new example using UTF-8 encoding (need a UTF-8 compatible text editor)Coe
@DmitryAurokk That's odd. Just to be sure can you explicitly assign the UTF-8 character codes to the variable to ruleout text file encoding? Replace the line that creates the string with this: std::string string("\u041F\u0440\u0438\u0432\u0435\u0442"); Values taken from this table here.Coe

© 2022 - 2024 — McMap. All rights reserved.