When should we prefer wide-character strings?

Asked 31/8, 2017 at 14:12 Answered 31/8, 2017 at 15:26

I am modernizing a large, legacy MFC codebase which contains a veritable medley of string types:

CString
std::string
std::wstring
char*
wchar_t*
_bstr_t

I'd like to standardize on a single string type internally, and convert to other types only when absolutely required by a third-party API (i.e. COM or MFC functions). The question my coworkers and I are debating; which string type should we standardize on?

I would prefer one of the C++ standard strings: std::string or std::wstring. I'm personally leaning toward std::string, because we do not have any need for wide characters - it is an internal codebase with no customer-facing UI (i.e. no need for multiple-language support). "Plain" strings allow us to use simple, unadorned string literals ("Hello world" vs L"Hello world" or _T("Hello world")).

Is there an official stance from the programming community? When faced with multiple string types, what is typically used as the standard 'internal' storage format?

Finbur answered 31/8, 2017 at 14:12 Comment(9)

"UTF8 everywhere" comes to mind. But Windows isn't UTF8-friendly. Note however that you really should treat file names as distinct types. Using boost may be a good choice. – Discordance 31/8, 2017 at 14:14

Windows internally is UTF-16LE so std::wstring is a good fit for that platform; so is std::vector<wchar_t>. – Hierodule 31/8, 2017 at 14:14

For a Windows application use std::wstring. With narrow strings you'd need conversions all over the place. Note: since you don't already know this, you're not a good choice for person to do the job, it's basics. That choice is your manager's fault. – Denazify 31/8, 2017 at 14:18

Windows provides narrow-char alternatives for nearly all APIs. In-code conversions would not be necessary. They may be performed behind the scenes, but that's not really a concern. It reeks of premature micro-optimizations. – Finbur 31/8, 2017 at 14:20

Re _T("Hello world"), the T macros were obsoleted in the year 2000 by the introduction of Layer for Unicode, and today our tools can't produce executables for the Windows versions (9x) that these macros target. I understand it's a legacy codebase. But when your task is to clean it up, mentioning T macros as convenient is absurd and very counter-productive. – Denazify 31/8, 2017 at 14:21

If you choose narrow chars then all you need to break your program is one employee with a non-latin name and you hit encoding problems for the user and below directories. – Hierodule 31/8, 2017 at 14:22

utf8everywhere.org – Absinthe 31/8, 2017 at 14:22

@BTownTKD I see a general tendency of newer APIs to provide a wide-char interface only. – Thorstein 31/8, 2017 at 14:23

Things like bstr_t you'll need when you're interacting with COM and windows provides various functions to create them. Elsewhere you should just use std::wstring and wchar_t if you're coding for Windows exclusively. It's easier. – Lauraine 31/8, 2017 at 14:42

If we talk about Windows, than I'd use std::wstring (because we often need cool string features), or wchar_t* if you just pass strings around.

Note Microsoft recommends that here: Working with Strings

Windows natively supports Unicode strings for UI elements, file names, and so forth. Unicode is the preferred character encoding, because it supports all character sets and languages. Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value. UTF-16 characters are called wide characters, to distinguish them from 8-bit ANSI characters. The Visual C++ compiler supports the built-in data type wchar_t for wide characters

Also:

When Microsoft introduced Unicode support to Windows, it eased the transition by providing two parallel sets of APIs, one for ANSI strings and the other for Unicode strings. [...] Internally, the ANSI version translates the string to Unicode.

Also:

New applications should always call the Unicode versions. Many world languages require Unicode. If you use ANSI strings, it will be impossible to localize your application. The ANSI versions are also less efficient, because the operating system must convert the ANSI strings to Unicode at run time. [...] Most newer APIs in Windows have just a Unicode version, with no corresponding ANSI version.

Rotberg answered 31/8, 2017 at 14:46 Comment(3)

because we often need cool string features ... could be elaborated a bit. Why not use CString instead, MFC uses it everywhere? Not that I'd recommend doing so ;-) – Thorstein 31/8, 2017 at 14:59

@Thorstein - 10 or 20 years ago (yes, I'm so old :-), I'd have recommended that too, but today, there are so many samples/codes/open source/etc. using std:: and also so many people used to it, that I feel ok with std:: however, I see CString as perfectly ok too, as long as you make sure that no one introduces std:: because of lazyness... – Rotberg 31/8, 2017 at 15:5

I'll add the comment that in the OP's specific case std::wstring is IMEHO the best choice. That said, don't make a blind decision to always use std::wstring and wide characters. Consider what the application is doing before making your choice. Anecdotally, many years ago (early 2000's IIRC) I took the source to the Dhrystone benchmark, and converted every instance of char in it to wchar_t. Doing so resulted in about a 15% slowdown, so be aware that use of wide characters does come with a price. It's up to you to make the decision regarding whether that price matters to you. – Squalor 9/8, 2022 at 1:39

It depends.

When programming for Windows, I recommend to use std::wstring at least for:

Resources (Strings, Dialogs, etc.)
Filesystem access (Windows allows non-ASCII characters in file and directory names (that includes all the "wrong kinds of apostrophes" btw), these are impossible to open using ANSI API)
COM (a BSTR is always wide character)
Other user-facing interfaces (clipboard, system error reporting, etc)

However, it is easier to handle internal ASCII data files and UTF-8-encoded-data using single-character strings. It's fast, efficient and straightforward.

There may also be other aspects that are not mentioned in the question, such as databases or APIs used, input/output files, etc. and their charsets - all of those play a role when deciding on the best data structures for the job.

"UTF-8 everywhere" is a sound idea in general. But there is 0 Windows API that takes UTF-8. Even the std::experimental::filesystem API uses std::wstring on Windows and std::string on POSIX.

Swallow answered 31/8, 2017 at 15:26 Comment(0)

Recommended topics

Hot tags