What is Microsoft using as the data type for Unicode Strings?
Asked Answered
H

4

6

I am in the process of learning C++ and came across an article on the MSDN here:

http://msdn.microsoft.com/en-us/magazine/dd861344.aspx

In the first code example the one line of code which my question relates to is the following:

VERIFY(SetWindowText(L"Direct2D Sample"));

More specifically that L prefix. I had a little read up, and correct me if I am wrong :-), but this is to allow for unicode strings, i.e. to prep for a long character set. Now in during my read up on this I came across another article on Adavnced String Techniques in C here http://www.flipcode.com/archives/Advanced_String_Techniques_in_C-Part_I_Unicode.shtml

It says there are a few options including the inclusion of the header:

#define UNICODE 

OR

#define _UNICODE

in C , again point out if I am wrong, appreciate your feedback. Further it shows the datatype suitable for these unicode strings being:

wchar_t

It throws into the mix a macro and a kind of hybrid datatype, the macro being:

_TEXT(t)

which simply prefixes the string with the L and the hybrid data type as

TCHAR 

Which it points out will allow for unicode if the header is there and ASCII if not. Now my question is, or more of an asumption which I would like to confirm, would Microsoft use this TCHAR data type which is more flexible or is there any benefit to committing to using the wchar_t.

Also when I say does Microsoft use this, more specifically for exmaple in the ATL and WTL libraries, do anyone of yourselves have preference or have some advice regarding this?

Cheers,

Andrew

Hyaluronidase answered 27/8, 2009 at 10:45 Comment(1)
Thanks for everyone's feedback! Appreciate it! :-)Hyaluronidase
B
11

For all new software you should define UNICODE and use wchar_t directly. Using ANSI stirngs will come back to haunt you.

You should just use wchar_t and the wide versions of all the CRT functions (ex: wcscmp instead of strcmp). The TEXT macros and TCHAR etc just exist if your code needs to work in both ANSI and UNICODE environments which I feel code rarely needs to do.

When you create a new windows application using Visual Studio UNICODE is automatically defined and wchar_t will work like a built-in.

Brigham answered 27/8, 2009 at 10:51 Comment(0)
G
6

Short answer: the hybrid infrastructure with the TCHAR type, the _TEXT() macro and the various _t* functions (_tcscpy comes to mind) are a throwback to the times when Microsoft had two platforms coexisting:

  1. The Windows NT line was based on the Unicode string representation
  2. The Windows 95/98/ME line was based on ANSI string representation.

String representation here means that all the Windows APIs that expected or returned string to your app used one or the other representation for these strings. COM added even more confusion as it was available on both platforms -- and expected Unicode strings on both!

In those old times it was encouraged that you write "portable" code: you were instructed to use the hybrid infrastructure for your strings so that you can compile for both models just by defining/undefining UNICODE and/or _UNICODE for your app.

As the Windows9x line is no more relevant (for the vast majority of the apps anyway) you can safely ignore the ANSI world and use the Unicode strings directly.

Beware though that Unicode has multiple representations today: as it is pointed out above the Unicode convention implied by wchar_t is the UCS-2 representation (all characters encoded in 16-bit words). There are other, widely used representations where this is not necessarily true.

Gisellegish answered 27/8, 2009 at 11:0 Comment(0)
F
3

On Windows it's wchar_t with UTF-16 (2 bytes) encoding.

Source : http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

Fleshpots answered 27/8, 2009 at 10:52 Comment(3)
Don't agree. wchar_t is supposed to be fixed-width and is in the Microsoft world. There is the old defunct UCS-2 which is fixed-width. Both support a maximum of 65536 characters. UTF-16 is variable width character set, where each part of the character is 2 bytes. Characters are either 2-bytes or 4-bytes. This ensures UTF-16 can support 1,114,112 characters. Microsoft uses UCS-2 for wchar_t as far as I know.Dowzall
@Dowzall you're mixing the concepts of code point and code unit. Even on Windows the 16-bit wchar_t type is perfectly capable of encoding everything representable with UTF-16. In fact in one of the XP versions UTF-16 replaced UCS-2. Alas, for many purposes (e.g. on the file system) the difference between UCS-2 and UTF-16 are of no importance. Microsoft used to use UCS-2 as its default in initial versions of Windows NT, true. In fact this is one aspect where they were well ahead technologically. By now it's UTF-16.Ohalloran
0xC0000022L: No I am not mixing concepts. I chose my words with care. First the ISO C++ standards requires wchar_t to be fixed width, capable of representing every character. It says that. Second, MIcrosoft's WCHAR_T (effectively UCS-2) is 16-bit, it is fixed width, it is capable of represent a maximum of 65535 characters leaving 0 as the string terminator. Third, UTF-16, now comprises 1,112,064 characters. UTF-16 is multiple byte character set. Some characters are 16-bytes, same as UCS-2. But some (Indian) are 32-bytes, effectively, 2 WCHAR_T. How can WCHAR_T encode UTF-16?Dowzall
L
0

TCHAR changes its type depending if UNICODE is defined, and should be used when you want code that you can compile for UNICODE and non-UNICODE.

If you want to explicitly process UNICODE data only, then feel free to use wchar_t.

Lipson answered 27/8, 2009 at 10:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.