determine whether a unicode character is fullwidth or halfwidth in C++

Asked 27/2, 2013 at 14:15 Answered 9/7, 2014 at 6:0

I'm writing a terminal (console) application that is supposed to wrap arbitrary unicode text.

Terminals are usually using a monospaced (fixed width) font, so to wrap a text, it's barely more than counting characters and watching whether a word fits into a line or not and act accordingly.

Problem is that there are fullwidth characters in the Unicode table that take up the width of 2 characters in a terminal.

Counting these would see 1 unicode character, but the printed character is 2 "normal" (halfwidth) characters wide, breaking the wrapping routine as it is not aware of chars that take up twice the width.

As an example, this is a fullwidth character (U+3004, the JIS symbol)

〄
12

It does not take up the full width of 2 characters here although it's preformatted, but it does use twice the width of a western character in a terminal.

To deal with this, I have to distinguish between fullwidth or halfwidth characters, but I cannot find a way to do so in C++. Is it really necessary to know all fullwidth characters in the unicode table to get around the problem?

Immaterial answered 27/2, 2013 at 14:15 Comment(4)

Relevant icu-project.org/apiref/icu4c/… and unicode.org/reports/tr11 – Anteversion 27/2, 2013 at 14:21

For which OS/Platform? – Milwaukee 27/2, 2013 at 14:27

Sorry I missed that. OS is Linux. – Immaterial 27/2, 2013 at 14:50

I'm not sure how terminals will handle super wide characters like these. Not on my Linux right now to test printing ௵ 𒈙 𒐫﷽ – Leadwort 11/3, 2018 at 3:10

You should use ICU u_getIntPropertyValue with the UCHAR_EAST_ASIAN_WIDTH property.

For example:

bool is_fullwidth(UChar32 c) {
    int width = u_getIntPropertyValue(c, UCHAR_EAST_ASIAN_WIDTH);
    return width == U_EA_FULLWIDTH || width == U_EA_WIDE;
}

Note that if your graphics library supports combining characters then you'll have to consider those as well when determining how many cells a sequence uses; for example e followed by U+0301 COMBINING ACUTE ACCENT will only take up 1 cell.

Sweetening answered 27/2, 2013 at 14:24 Comment(2)

I'm about to replace all calls to ICU right now to minimize dependencies. Maybe I can build a table of all fullwidth characters with the help of the u_getIntPropertyValue method. Thanks for the hint to the combining characters. I will check whether this applies to terminals, too. – Immaterial 27/2, 2013 at 14:53

@Immaterial It may no longer be relevant for you, but I've recently put together the character ranges for a similar question, here: https://mcmap.net/q/1315344/-validate-japanese-character-in-active-record-callback – Suck 8/4, 2013 at 8:54

There's no need to build tables, people from Unicode have already done that:

http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

The same code is used in terminal emulating software such as xterm[1], konsole[2] and quite likely others...

Fought answered 9/7, 2014 at 6:0 Comment(0)

Recommended topics

Hot tags