should I eliminate TCHAR from Windows code?

Asked 11/6, 2011 at 11:16 Answered 13/9, 2012 at 1:39

I am revising some very old (10 years) C code. The code compiles on Unix/Mac with GCC and cross-compiles for Windows with MinGW. Currently there are TCHAR strings throughout. I'd like to get rid of the TCHAR and use a C++ string instead. Is it still necessary to use the Windows wide functions, or can I do everything now with Unicode and UTF-8?

Easeful answered 11/6, 2011 at 11:16 Comment(4)

Related: https://mcmap.net/q/236655/-is-tchar-still-relevant – Gilletta 11/6, 2011 at 15:42

Using a C++ std::wstring in C code is not advisable. – Counteroffensive 11/6, 2011 at 17:23

I have successfully used TCHAR to get several smallish tools to compile under Windows, Linux, and Solaris, each using its native Unicode format (UTF-16 or UTF-8). But it does involve making your own tchar.h for the *nix platforms. – Disorient 10/8, 2011 at 10:38

In fact, that is what we ended up doing. – Easeful 10/8, 2011 at 23:52

Windows uses UTF16 still and most likely always will. You need to use wstring rather than string therefore. Windows APIs don't offer support for UTF8 directly largely because Windows supported Unicode before UTF8 was invented.

It is thus rather painful to write Unicode code that will compile on both Windows and Unix platforms.

Allative answered 11/6, 2011 at 13:56 Comment(12)

Windows uses a horrible mixture of UCS-2 and UTF-16. Using characters outside the BMP is somewhat hit-or-miss. – Dacy 11/6, 2011 at 14:3

@Ben I thought the UCS-2 stuff was mostly limited to console APIS. Is it broader than that? – Allative 11/6, 2011 at 14:20

@David: Maybe it's a documentation bug, but if you trust the documentation, even WideCharToMultiByte and MultiByteToWideChar only handle UCS-2 (returning the number of UTF-16 characters is useless for buffer allocation). GetWindowTextLength is similarly broken, returning the number of characters (there's a footnote that alludes to multibyte character sets, but states that this special behavior only occurs when mixing ANSI and Unicode). – Dacy 11/6, 2011 at 14:44

number of characters means number of wchars. The problem would be if the functions returned number of code points. But they don't. – Allative 11/6, 2011 at 14:48

@David: Only on UCS-2 are they the same. Look at the other parameter... specifically in bytes and not characters. The author knows about variable-length encodings, and chose to provide for them on the "MultiByte" side and not on the "WideChar" side. – Dacy 11/6, 2011 at 14:50

@ben you are all mixed up here. Those functions are fine. Number of characters is exactly what you need to allocate buffers. It would only be a problem if the functions used number of code points. They don't. – Allative 11/6, 2011 at 14:59

@David: There is a 1:1 mapping from characters to code points (the reverse is not true, a few code points are not characters). An encoding of a character can require multiple char, in the case of UTF-8, or multiple wchar_t, in the case of UTF-16. The assumption that one character is one wchar_t holds for (1) 32-bit wchar_t, which is not the case on Windows, or (2) UCS-2. This is using the word character the way it's used in Unicode literature. When MS use the word differently, they make a horrible mess, which only works out for UCS-2. – Dacy 11/6, 2011 at 15:5

@ben Character is a loaded term. But what MS mean by it is a TCHAR. – Allative 11/6, 2011 at 15:10

@David: the technical term for what MS calls "character" is "code unit", in this case "UTF-16 code unit". When technical docs talk about "characters", they almost invariably mean either code points or code units. – Musser 11/6, 2011 at 16:31

Thanks for the info. It seems that wstring isn't available on Unix systems. – Easeful 11/6, 2011 at 20:4

wstring is available anywhere that has C++ because it's in the standard library, but it's useless on UNIX since UNIX is UTF-8 – Allative 11/6, 2011 at 20:49

@Philipp: in this context, when MS talk about characters, they mean TCHARs, not code points. The sizes they give are in TCHARs, and can be used to allocate buffers. If they'd give the size in code points, you would have to calculate the size of a buffer yourself. – Mello 23/7, 2014 at 16:13

Is it still necessary to use the Windows wide functions, or can I do everything now with Unicode and UTF-8?

Yes. Unfortunately, Windows does not have native support for UTF-8. If you want proper Unicode support, you need to use the wchar_t version of the Windows API functions, not the char version.

should I eliminate TCHAR from Windows code?

Yes, you should. The reason TCHAR exists is to support both Unicode and non-Unicode versions of Windows. Non-Unicode support may have been a major concern back in 2001 when Windows 98 was still popular, but not today.

And it's highly unlikely that any non-Windows-specific library would have the same kind of char/wchar_t overloading that makes TCHAR usable.

So go ahead and replace all your TCHARs with wchar_ts.

The code compiles on Unix/Mac with GCC and cross-compiles for Windows with MinGW.

I've had to write cross-platform C++ code before. (Now my job is writing cross-platform C# code.) Character encoding is rather painful when Windows doesn't support UTF-8 and Un*x doesn't support UTF-16. I ended up using UTF-8 as our main encoding and converting as necessary on Windows.

Gilletta answered 11/6, 2011 at 16:10 Comment(1)

UTF-8 Everywhere also suggests using UTF-8 everywhere and convert as necessary – Trinitroglycerin 28/3, 2014 at 15:9

Yes, writing non-unicode applications nowadays is shooting yourself in the foot. Just use the wide API everywhere, and you'll not have to cry about it later. You can still use UTF8 on UNIX and wchar_t on Windows if you don't need (network) communication between platforms (or convert the wchar_t's with Win32 API to UTF-8), or go the hard way and use UTF-8 everywhere and convert to wchar_t's when you use Win32 API functions (that's what I do).

Polyadelphous answered 11/6, 2011 at 13:58 Comment(0)

To directly answer your question:

Is it still necessary to use the Windows wide functions, or can I do everything now with Unicode and UTF-8?

No, (non-ASCII) UTF-8 is not accepted by the vast majority of Windows API functions. You still have to use the wide APIs.

One could similarly bemoan that other OSes still have no support for wchar_t. So you also have to support UTF-8.

The other answers provide some good advice on how to manage this in a cross-platform codebase, but it sounds as if you already have an implementation supporting different character types. As desirable as ripping that out to simplify the code might sound, don't.

Dacy answered 11/6, 2011 at 14:7 Comment(0)

And I predict that someday, although probably not before the year 2020, Windows will add UTF-8 support, simply by adding U versions of all the API functions, alongside A and W, plus the same kind of linker hack. The 8-bit A functions are just a translation layer over the native W (UTF-16) functions. I bet they could generate a U-layer semi-automatically from the A-layer.

Once they've been teased enough, long enough, about their '20th century' Unicode support...

They'll still manage to make it awkward to write, ugly to read and non-portable by default, by using carefully chosen macros and default Visual Studio settings.

Garget answered 13/9, 2012 at 1:39 Comment(0)

Recommended topics

Hot tags