C++: how to convert ASCII or ANSI to UTF8 and stores in std::string

U

2

4

My company use some code like this:

    std::string(CT2CA(some_CString)).c_str()

which I believe it converts a Unicode string (whose type is CString)into ANSI encoding, and this string is for a email's subject. However, header of the email (which includes the subject) indicates that the mail client should decode it as a unicode (this is how the original code does). Thus, some German chars like "ä ö ü" will not be properly displayed as the title.

Is there anyway that I can put this header back to UTF8 and store into a std::string or const char*?

I know there are a lot of smarter ways to do this, but I need to keep the code sticking to its original one (i.e. sent the header as std::string or const char*).

Thanks in advance.

Upbringing answered 28/11, 2013 at 22:46 Comment(2)

you'll probably want std::wstring – Eudemon 28/11, 2013 at 22:47

There are no precooked macros that convert to utf8. Just create your own, call WideCharToMultiByte() with CP_UTF8. – Reims 28/11, 2013 at 22:58

L

3

This sounds like a plain conversion from one encoding to another encoding: You can use std::codecvt<char, char, mbstate_t> for this. Whether your implementation ships with a suitable conversion, I don't know, however. From the sounds of it you just try to convert ISO-Latin-1 into Unicode. That should be pretty much trivial: the first 128 characters map (0 to 127) identically to UTF-8 and the second half conveniently map to the corresponding Unicode code points, i.e., you just need to encode the corresponding value into UTF-8. Each character will be replaced by two characters. That it, I think the conversion is something like that:

// Takes the next position and the end of a buffer as first two arguments and the
// character to convert from ISO-Latin-1 as third argument.
// Returns a pointer to end of the produced sequence.
char* iso_latin_1_to_utf8(char* buffer, char* end, unsigned char c) {
    if (c < 128) {
        if (buffer == end) { throw std::runtime_error("out of space"); }
        *buffer++ = c;
    }
    else {
        if (end - buffer < 2) { throw std::runtime_error("out of space"); }
        *buffer++ = 0xC0 | (c >> 6);
        *buffer++ = 0x80 | (c & 0x3f);
    }
    return buffer;
}

Limy answered 28/11, 2013 at 23:1 Comment(2)

Be careful, under Windows, ANSI means CP-1252, and this charset looks like iso-latin1, but it differs... – Genus 31/5, 2019 at 9:53

@GenericAccountName: I made the change. Thanks for conmenting - I think I was unaware. – Riband 24/3, 2020 at 22:54

I

6

Becareful : it's '|' and not '&' !

*buffer++ = 0xC0 | (c >> 6);
*buffer++ = 0x80 | (c & 0x3F);

Impervious answered 25/3, 2015 at 14:40 Comment(0)

L

3

This sounds like a plain conversion from one encoding to another encoding: You can use std::codecvt<char, char, mbstate_t> for this. Whether your implementation ships with a suitable conversion, I don't know, however. From the sounds of it you just try to convert ISO-Latin-1 into Unicode. That should be pretty much trivial: the first 128 characters map (0 to 127) identically to UTF-8 and the second half conveniently map to the corresponding Unicode code points, i.e., you just need to encode the corresponding value into UTF-8. Each character will be replaced by two characters. That it, I think the conversion is something like that:

// Takes the next position and the end of a buffer as first two arguments and the
// character to convert from ISO-Latin-1 as third argument.
// Returns a pointer to end of the produced sequence.
char* iso_latin_1_to_utf8(char* buffer, char* end, unsigned char c) {
    if (c < 128) {
        if (buffer == end) { throw std::runtime_error("out of space"); }
        *buffer++ = c;
    }
    else {
        if (end - buffer < 2) { throw std::runtime_error("out of space"); }
        *buffer++ = 0xC0 | (c >> 6);
        *buffer++ = 0x80 | (c & 0x3f);
    }
    return buffer;
}

Limy answered 28/11, 2013 at 23:1 Comment(2)

Be careful, under Windows, ANSI means CP-1252, and this charset looks like iso-latin1, but it differs... – Genus 31/5, 2019 at 9:53

@GenericAccountName: I made the change. Thanks for conmenting - I think I was unaware. – Riband 24/3, 2020 at 22:54

Recommended topics

Hot tags