C++ substring multi byte characters

Asked 1/6, 2012 at 8:34 Answered 22/1, 2016 at 6:55

I am having this std::string which contains some characters that span multiple bytes.

When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.

So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.

Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.

Thanks

Sole answered 1/6, 2012 at 8:34 Comment(6)

Here's a related question: https://mcmap.net/q/55151/-how-to-convert-wstring-into-string/1158895 – Die 1/6, 2012 at 8:36

what's the encoding of your string ? I assume UTF-8 – Faceoff 1/6, 2012 at 8:37

@SirDarius, UTF-8 indeed. But I think this question could serve for any encoding taking multiple bytes for one character, no? – Sole 1/6, 2012 at 8:45

of course, but it is important to know, because some encodings require wide strings with characters as large as 32 bits. You might want to use a library such as libiconv. – Faceoff 1/6, 2012 at 9:0

@W.Goeman: One important issue: std::wstring is unfortunately implementation dependent (16 bits wide characters on Windows, 32 bits wide on Linux), therefore it is not sufficient. – Donaugh 1/6, 2012 at 11:29

"UTF-8 indeed" in NFC or in NFD? – Codify 14/8, 2012 at 6:50

There are really only two possible solutions. If you're doing this a lot, over large distances, you'd be better off converting your characters to a single element encoding, using wchar_t (or int32_t, or whatever is most appropriate. This is not a simple copy, which would convert each individual char into the target type, but a true conversion function, which would recognize the multibyte characters, and convert them into a single element.

For occasional use or shorter sequences, it's possible to write your own functions for advancing n bytes. For UTF-8, I use the following:

inline size_t
size(
    Byte                ch )
{
    return byteCountTable[ ch ] ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::random_access_iterator_tag )
{
    return begin + size ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::input_iterator_tag )
{
    while ( size != 0 ) {
        ++ begin ;
        -- size ;
    }
    return begin ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    InputIterator       end )
{
    if ( begin != end ) {
        begin = succ( begin, end, size( *begin ),
                      std::::iterator_traits< InputIterator >::iterator_category() ) ;
    }
    return begin ;
}

template< typename InputIterator >
size_t
characterCount(
    InputIterator       begin,
    InputIterator       end )
{
    size_t              result = 0 ;
    while ( begin != end ) {
        ++ result ;
        begin = succ( begin, end ) ;
    }
    return result ;
}

Gallinule answered 1/6, 2012 at 9:53 Comment(1)

It should be noticed that wchar_t is only 16-bit on many platforms, so it's incapable of representing many codepoints in only a single element. Instead, char32_t is a type provided by C++11 that is fixed-width and sufficient to represent all of Unicode in a single element. – Mistaken 2/8, 2017 at 18:0

Simpler version. based on the solution provided Getting the actual length of a UTF-8 encoded std::string? by Marcelo Cantos

std::string substr(std::string originalString, int maxLength)
{
    std::string resultString = originalString;

    int len = 0;
    int byteCount = 0;

    const char* aStr = originalString.c_str();

    while(*aStr)
    {
        if( (*aStr & 0xc0) != 0x80 )
            len += 1;

        if(len>maxLength)
        {
            resultString = resultString.substr(0, byteCount);
            break;
        }
        byteCount++;
        aStr++;
    }

    return resultString;
}

Leialeibman answered 14/8, 2012 at 6:37 Comment(1)

Unfortunately, this solution is only partially correct. If you look at my answer you will see that you are successfully taking care of 1. (not cutting in the middle of a codepoint) however you are failing at 2. (not separating a codepoint from its diacritics) and 3. (not cutting in the middle of a semantic character, such as LL in Spanish). The latter two are rarer cases, certainly, but... well, dealing correctly with edge cases is necessary. – Donaugh 12/9, 2016 at 10:43

A std::string object is not a string of characters, it's a string of bytes. It has no notion of what's called "encoding" at all. Same goes for std::wstring, except that it's a string of 16bit values.

In order to perform operations on your text which require addressing distinct characters (as is the case when you want to take the substring, for instance) you need to know what encoding is used for your std::string object.

UPDATE: Now that you clarified that your input string is UTF-8 encoded, you still need to decide on an encoding to use for your output std::wstring. UTF-16 comes to mind, but it really depends on what the API which you will pass the std::wstring objects to expect. Assuming that UTF-16 is acceptable you have various choices:

On Windows, you can use the MultiByteToWideChar function; no extra dependencies required.
The UTF8-CPP library claims to provide a lightweight solution for dealing with UTF-* encoded strings. Never tried it myself, but I keep hearing good things about it.
On Linux systems, using the libiconv library is quite common.
If you need to deal with all sorts of crazy encodings and want the full-blown alpha-and-omega word as far as encodings go, look at ICU.

Craftsman answered 1/6, 2012 at 8:38 Comment(3)

std::wstring is a string of wchar_t which may be 16 bits, or 32 bits. – Dowitcher 1/6, 2012 at 8:40

I am aware of that, and I do know my encoding. The question is how to do the transformation using that encoding. – Sole 1/6, 2012 at 8:47

@W.Goeman: I now updated my answer with some suggestions how to convert UTF-8 to some other encoding (even with std::wstring you still need to make up your mind what encoding to use). – Craftsman 5/6, 2012 at 6:46

For occasional use or shorter sequences, it's possible to write your own functions for advancing n bytes. For UTF-8, I use the following:

inline size_t
size(
    Byte                ch )
{
    return byteCountTable[ ch ] ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::random_access_iterator_tag )
{
    return begin + size ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::input_iterator_tag )
{
    while ( size != 0 ) {
        ++ begin ;
        -- size ;
    }
    return begin ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    InputIterator       end )
{
    if ( begin != end ) {
        begin = succ( begin, end, size( *begin ),
                      std::::iterator_traits< InputIterator >::iterator_category() ) ;
    }
    return begin ;
}

template< typename InputIterator >
size_t
characterCount(
    InputIterator       begin,
    InputIterator       end )
{
    size_t              result = 0 ;
    while ( begin != end ) {
        ++ result ;
        begin = succ( begin, end ) ;
    }
    return result ;
}

Gallinule answered 1/6, 2012 at 9:53 Comment(1)

Unicode is hard.

std::wstring is not a list of codepoints, it's a list of wchar_t, and their width is implementation-defined (commonly 16 bits with VC++ and 32 bits with gcc and clang). Yes, it means it's useless for portable code...
A single character may be encoded on several code points (because of diacritics)
In some language, two different characters together form a "unit" that is not really separable (for example, LL is considered a letter on its own in Spanish).

So... it's a bit hard.

Solving 3) may be costly (it requires specific language/usage annotations); solving 1) and 2) is absolutely necessary... and requires Unicode aware libraries or coding your own (and probably getting it wrong).

1) is trivially solved: writing a routine transforming from UTF-8 to CodePoint is trivial (a CodePoint can be represented with an uint32_t)
2) is more difficult, it requires a list of diacritics and the sub routine must know never to cut prior to a diacritic (they follow the character they qualify)

Otherwise, there is probably what you seek in ICU. I wish you good luck finding it.

Donaugh answered 1/6, 2012 at 11:38 Comment(7)

Unicode is not hard if your language comes with a decent standard libary. C++ doesn't have such. Also, LL is not a letter in Spanish; it used to be considered such due to some Royal Academy stupidity but they finally admitted it's a digraph and for quite some time it hasn't been considered a letter in academia, textbooks, Spanish locales, and so on. The confusion came from Spanish letters being so close to phonemes, and LL and CH being used to represent different phonemes. – Regulation 6/6, 2012 at 9:3

@MiguelPérez: Ah glad to know, I learned Spanish a few years ago and the dictionary confused me a lot because of that weirdness. Unfortunately it is not the only language where this occurs ;) – Donaugh 6/6, 2012 at 9:41

"diacritic (they follow the character they qualify)" what is the proper name for this? Masochism? – Codify 17/8, 2012 at 1:39

@curiousguy: whether following or preceding, I do not think it would be any better. And it does avoid the combinatorial explosion of possibilities... and make our life really difficult :( – Donaugh 17/8, 2012 at 13:2

What I mean is: with prefixing, is you read a diacritic, you know it must be followed by another code point, and you have to consume one more code point. But with postfixing, if you read a letter, you do know not anything. It could be a letter alone, or it could combine with the following diacritic: you have to look at the next code point, just in case (how can you even do that on blocking streams?). – Codify 17/8, 2012 at 20:0

If the following code point is not a diacritic, you have to it back in some buffer. With istream you can use unget() (sungetc() for streambuf, but it only works if an "input sequence putback position" is available, I don't remember when if it is ever guaranteed), for a Unix file (tty, TCP, pipe...), you have ... nothing AFAIK. – Codify 17/8, 2012 at 20:0

@curiousguy: with istream you can peek at the next character without taking it out :) I do agree though that in hindsight prefixing would have been better. – Donaugh 18/8, 2012 at 9:52

Based on this I've written my utf8 substring function:

void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring)
{
    int len = 0, byteIndex = 0;
    const char* aStr = originalString.c_str();
    size_t origSize = originalString.size();

    for (byteIndex=0; byteIndex < origSize; byteIndex++)
    {
        if((aStr[byteIndex] & 0xc0) != 0x80)
            len += 1;

        if(len >= SubStrLength)
            break;
    }

    csSubstring = originalString.substr(0, byteIndex);
}

Lexie answered 22/1, 2016 at 6:55 Comment(0)

Let me assume for simplicity that your encoding is UTF-8. In this case we would have some chars occupying more than one byte, as in your case. Then you have std::string, where those UTF-8 encoded characters are stored. And now you want to substr() in terms of chars, not bytes. I'd write a function that will convert character length to byte length. For the utf 8 case it would look like:

#define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1

int32 GetByteCountForCharCount(const char* utf8Str, int charCnt)
{
    int ByteCount = 0;
    for (int i = 0; i < charCnt; i++)
    {
        int charlen = UTF8_CHAR_LEN(*utf8Str);
        ByteCount += charlen;
        utf8Str += charlen;
    }
    return ByteCount;
}

So, say you want to substr() the string from 7-th char. No problem:

int32 pos = GetByteCountForCharCount(str.c_str(), 7);
str.substr(pos);

Depose answered 1/6, 2012 at 9:4 Comment(1)

Why oh why using a macro instead of an inline function ? Why oh why passing a char const* instead of a std::string const&. Please, use C++ idioms in C++ questions. – Donaugh 1/6, 2012 at 11:34

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags