C++ project type: unicode vs multi-byte; pros and cons

Asked 17/6, 2010 at 17:33 Answered 22/6, 2010 at 8:47

I'm wondering what the Stack Overflow community thinks when it comes to creating a project (thinking primarily c++ here) with a unicode or a multi-byte character set.

Are there pros to going Unicode straight from the start, implying all your strings will be in wide format? Are there performance issues / larger memory requirements because of a standard use of a larger character?
Is there an advantage to this method? Do some processor architectures handle wide characters better?
Are there any reasons to make your project Unicode if you don't plan on supporting additional languages?
What reasons would one have for creating a project with a multi-byte character set?
How do all of the factors above collide in a high performance environment (such as a modern video game) ?

Noenoel answered 17/6, 2010 at 17:33 Comment(7)

seems a bit subjective, also a lot like a question a professor would give. Namely these parts: What reasons would one have for creating a project with a multi-byte character set? How do all of the factors above collide in a high performance environment (such as a modern video game) ? – Tipstaff 17/6, 2010 at 17:37

"Are there any reasons to make your project Unicode if you don't plan on supporting additional languages?" If you plan on using characters with codepoints between 128 and 255, yes. Dealing with code pages can be quite annoying. – Indrawn 17/6, 2010 at 17:40

UTF-8 is a multi-byte character set (variable-length character encoding), is it not? UTF-16 is also a variable-length character encoding. – Rivas 17/6, 2010 at 17:46

What exactly do you mean by a multi-byte character set? All character encodings that support all unicode characters encode most characters with more than one byte per character. If you mean a variable width encoding then this does not exclude unicode support. UTF-8 is a very prevalent variable width character encoding that supports all unicode characters. – Riess 17/6, 2010 at 17:47

I'm not entirely sure, but I know in the character set you can specify a multi-byte one that supports ANSI as well as unicode sets, and chars default to ASCII I believe. I'm wondering if it's worthwhile converting everything to wide chars, essentially. – Noenoel 17/6, 2010 at 17:54

Converting everything to "wide chars" isn't the same thing as supporting unicode. How you support unicode really depends on what you're actually doing and what APIs you plan to use. – Riess 17/6, 2010 at 17:59

Make sure you've read these two links; they may help clarify why your title "unicode vs multi-byte" and "Unicode... implying all your strings will be in wide format" are incorrect: #2260044 and joelonsoftware.com/articles/Unicode.html – Pearlstein 17/6, 2010 at 18:25

Two issues I'd comment on.

First, you don't mention what platform you're targeting. Although recent Windows versions (Win2000, WinXP, Vista and Win7) support both Multibyte and Unicode versions of system calls using strings, the Unicode versions are faster (the multibyte versions are wrappers that convert to Unicode, call the Unicode version, then convert any returned strings back to mutlibyte). So if you're making a lot of these types of calls the Unicode will be faster.

Just because you're not planning on explicitly supporting additional languages, you should still consider supporting Unicode if your application saves and displays text entered by the users. Just because your application is unilingual, it doesn't follow that all it's users will be unilingual too. They may be perfectly happy to use your English language GUI, but might want to enter names, comments or other text in their own language and have them displayed properly.

Strangle answered 17/6, 2010 at 18:20 Comment(2)

"you should still consider supporting Unicode if your application saves and displays text entered by the users" - and if your application wants to deal with paths with arbitrary characters - and if it deals in any way with paths, it should. – Goldina 17/6, 2010 at 18:26

This is exactly what I wanted to hear.. that one is a wrapper for the other. Unicode all the way baby. – Noenoel 17/6, 2010 at 18:56

You are talking about the VC++ Project setting here, right?

The only thing it affects is the version of Win32 API calls it ends up being exectuted. For instance, a call to MessageBox will end up as a call to MessageBoxA in case of the multi-byte setting, and MessageBoxW in case of Unicode setting. Of course, that will affect the types of string parameters to that functions as well. Internally, MessageBoxA calls MessageBoxW after converting the string paramteres from the current system locale to Unicode.

My advice is to use the Unicode settings and pass Unicode strings to Win32 API calls. That does not stop you from using strings in any other encoding internally.

Seleta answered 17/6, 2010 at 18:36 Comment(0)

The short answer (IMO, and I've been proving wrong) is that it'd better to plan for the worse (or best depending on your point of view) and do unicode right now.

Unless your application is very string intensive, then going directly to unicode will not really matter; in the case of games, it should not be a big factor compared to the rest of the engine.

Max.

Niehaus answered 17/6, 2010 at 17:38 Comment(3)

What if, for some magical reason, you are using a character string in a tight loop. Will there be a sizable performance difference? – Noenoel 17/6, 2010 at 17:48

@Stefan: That depends on what you're doing with that string. If you're copying it, and it consists mostly of ASCII characters, the MB version will be a bit shorter, and so copying it may be faster. If you're doing actual string processing, the Unicode version will likely be more efficient, because of its simpler structure. But really, this is such an absurdly hypothetical what-if question it's pointless. Your answer is "it doesn't matter performance-wise, and it never will, and if it does, you should test both and see what works best" – Convent 17/6, 2010 at 18:10

Also, if it matters performance wise you can just optimize that specific loop without changing the project type. – Volotta 17/6, 2010 at 18:34

Here's a simple consideration: should your program work if it's used by Mr. 菅直人 ? His home directory might be hard to represent in ASCII.

Deplorable answered 22/6, 2010 at 8:47 Comment(0)

Are there pros to going Unicode straight from the start,

A few years and a million lines of code later, you're going to wish you had answered "yes".

implying all your strings will be in wide format?

I wish Microsoft would quit conflating "Unicode" with UTF-16.

You don't have to store all your strings in wide format. You can use UTF-8 instead, and get a smaller memory footprint (for Latin alphabet languages), and backwards compatibility with 7-bit ASCII.

The one downside to using UTF-8 on Windows is that it's not supported as an ANSI code page, so you have to convert your strings to UTF-16 to make WinAPI calls. How much inconvenience this causes depends on whether you're writing a Windows program or a program that just happens to run on Windows.

Scrubber answered 18/6, 2010 at 2:52 Comment(0)

The first answer to that question should... answer everything you need to know.

Peeler answered 17/6, 2010 at 17:58 Comment(1)

This is basically a link-only answer. It should have been closed as a duplicate maybe. Also, "the first answer" may be different on how SO changes the site, orders by votes or not etc. It's hard to know what exactly you're referring to. – Saharanpur 1/4, 2022 at 9:24

Recommended topics

Hot tags