How to detect unicode string width in terminal?

Asked 23/5, 2016 at 17:30 Answered 23/5, 2016 at 19:10

I'm working on a terminal based program that has unicode support. There are certain cases where I need to determine how many terminal columns a string will consume before I print it. Unfortunately some characters are 2 columns wide (chinese, etc.), but I found this answer that indicates a good way to detect fullwidth characters is by calling u_getIntPropertyValue() from the ICU library.

Now I'm trying to parse the characters of my UTF8 string and pass them to this function. The problem I'm having now is that u_getIntPropertyValue() expects a UTF-32 code point.

What is the best way to obtain this from a utf8 string? I'm currently trying to do this with boost::locale (used elsewhere in my program), but I'm having trouble getting a clean conversion. My UTF32 strings that come from boost::locale are pre-pended with a zero-width character to indicate byte order. Obviously I can just skip the first four bytes of the string, but is there a cleaner way to do this?

Here is my current ugly solution:

inline size_t utf8PrintableSize(const std::string &str, std::locale loc)
{
    namespace ba = boost::locale::boundary;
    ba::ssegment_index map(ba::character, str.begin(), str.end(), loc);
    size_t widthCount = 0;
    for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it)
    {
        ++widthCount;
        std::string utf32Char = boost::locale::conv::from_utf(it->str(), std::string("utf-32"));

        UChar32 utf32Codepoint = 0;
        memcpy(&utf32Codepoint, utf32Char.c_str()+4, sizeof(UChar32));

        int width = u_getIntPropertyValue(utf32Codepoint, UCHAR_EAST_ASIAN_WIDTH);
        if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE))
        {
            ++widthCount;
        }

    }
    return widthCount;
}

Addendum answered 23/5, 2016 at 17:30 Comment(3)

If you already use ICU, why not use it for utf8-to-utf32 conversion too? – Bullhead 23/5, 2016 at 17:35

I'm not familiar with the ICU. I was trying to use boost::locale to insulate me from most of the complexity. Is there an easy way to get this utf32 code point from ICU directly? – Addendum 23/5, 2016 at 17:38

I'm not familiar with it either but I know it has everything anyone ever wanted from a unicode library. Spend some time with google and you will find it. – Bullhead 23/5, 2016 at 17:44

UTF-32 is the direct representation of the "code points" of the individual characters. So all you need to do is extract those from the UTF-8 characters and feed this to u_getIntPropertyValue.

I took your code and modified it to use u8_to_u32_iterator, which seems to be made just for this:

#include <boost/regex/pending/unicode_iterator.hpp>

inline size_t utf8PrintableSize(const std::string &str, std::locale loc)
{
    size_t widthCount = 0;
    for(boost::u8_to_u32_iterator<std::string::iterator> it(input.begin()), end(input.end()); it!=end; ++it)
    {
        ++widthCount;

        int width = u_getIntPropertyValue(*it, UCHAR_EAST_ASIAN_WIDTH);
        if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE))
        {
            ++widthCount;
        }

    }
    return widthCount;
}

Appoint answered 23/5, 2016 at 19:10 Comment(1)

Thank you for the boost implementation. Interesting that this is part of the regex library and not locale. – Addendum 23/5, 2016 at 19:38

@n.m was correct: there is an easy way to do this with ICS directly. Updated code is below. I suspect I can probably just use UnicodeString and bypass the whole boost locale usage in this scenario.

inline size_t utf8PrintableSize(const std::string &str, std::locale loc)
{
    namespace ba = boost::locale::boundary;
    ba::ssegment_index map(ba::character, str.begin(), str.end(), loc);
    size_t widthCount = 0;
    for (ba::ssegment_index::iterator it = map.begin(); it != map.end(); ++it)
    {
        ++widthCount;

        //Note: Some unicode characters are 'full width' and consume more than one
        // column on output.  We will increment widthCount one extra time for
        // these characters to ensure that space is properly allocated
        UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(it->str()));
        UChar32 codePoint = ucs.char32At(0);

        int width = u_getIntPropertyValue(codePoint, UCHAR_EAST_ASIAN_WIDTH);
        if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE))
        {
            ++widthCount;
        }

    }
    return widthCount;
}

Addendum answered 23/5, 2016 at 18:51 Comment(3)

Don't forget to handle zero-width characters too! – Heirship 23/5, 2016 at 19:7

@Heirship do you know how to check for this? I'm turning up blanks with my probably-misguided google search. – Addendum 23/5, 2016 at 19:41

Something like General_Category in {"Mn", "Me"} or Default_Ignorable_Code_Point - the latter includes formatting characters, soft hyphen, etc. But then, you also have to do even more complex stuff for Hangul combining, which depends on what the preceding character was. – Heirship 23/5, 2016 at 20:21

UTF-32 is the direct representation of the "code points" of the individual characters. So all you need to do is extract those from the UTF-8 characters and feed this to u_getIntPropertyValue.

I took your code and modified it to use u8_to_u32_iterator, which seems to be made just for this:

#include <boost/regex/pending/unicode_iterator.hpp>

inline size_t utf8PrintableSize(const std::string &str, std::locale loc)
{
    size_t widthCount = 0;
    for(boost::u8_to_u32_iterator<std::string::iterator> it(input.begin()), end(input.end()); it!=end; ++it)
    {
        ++widthCount;

        int width = u_getIntPropertyValue(*it, UCHAR_EAST_ASIAN_WIDTH);
        if ((width == U_EA_FULLWIDTH) || (width == U_EA_WIDE))
        {
            ++widthCount;
        }

    }
    return widthCount;
}

Appoint answered 23/5, 2016 at 19:10 Comment(1)

Thank you for the boost implementation. Interesting that this is part of the regex library and not locale. – Addendum 23/5, 2016 at 19:38

Recommended topics

Hot tags