wchar ends with single null byte or two of them?
Asked Answered
S

4

12

I just don't understand and can't find much info about wchar end.

If it ends with single null byte, how it know it not string end yet, if something like that "009A" represent one of unicode symbols?

If it ends with two null bytes? Well, I am not sure about it, need confirmation.

Spank answered 6/9, 2012 at 18:4 Comment(3)
in C++, i didn't knew wchar exist somewhere elseHydrograph
Somewhat related: Making a WCHAR null terminated. Might be some hints in there as to how to approach this.Janayjanaya
In C++, wchar_t (not wchar) is a predefined type. In C, wchar_t is a typedef defined in <stddef.h>. In both cases, the size is implementation-defined; on my system its size is 4 bytes (32 bits).Symbology
C
13

Since a wide string is an array of wide characters, it couldn't even end in an one-byte NUL. It is a two-byte NUL. (Arrays in C/C++ can only hold members of the same type, so of the same size).

Also, for ASCII standard characters, there always is one or three one-byte 0, as only extended characters start by a non-zero first byte (depending on whether wchar_t is 16 or 32 bit wide - for simplicity, I assume 16-bit and little-endian):

HELLO is 72 00 69 00 76 00 76 00 79 00 00 00
Cocainism answered 6/9, 2012 at 18:13 Comment(11)
err, so if i access array of wchar like that: arr[0] = 0; it will set to zero first and second byte automatically?Hydrograph
@Kosmos (If this is not yet clear, I suggest you to read a good tutorial on C pointers and arrays!)Cocainism
Is there anyway that wchar can be converted to char? I reversing chinese app, but as i see they are using char* for text manipulations. Could it be just wchar array converted to char* of double size?Hydrograph
@Kosmos There are libraries with which you can convert UTF-16 (wide strings) to UTF-8.Cocainism
@H2CO3: On my system, sizeof (wchar_t) == 4. You also seem to be making assumptions about endianness.Symbology
@KeithThompson yup, that sizeof is perfectly fine. And no, I am not making assumptions about endianness - whether it be little or big endian, it's easier to conceive the essentials if I write all this using big endian notation...Cocainism
I am trying to solve task to scan Chinese exe for text strings, for that i need to know how many bytes in the end - two null bytes or 4Hydrograph
@H2CO3: "only extended characters start by a non-zero first byte" -- that assumes big-endian (with your recent edit, you've made the assumption explicit).Symbology
@KeithThompson yes, sorry, you're correct - modern processor architectures that count use the counterintuitive little-endian notation, so that's why I was confusing them...Cocainism
Since this question is about the double byte null at the end of hte string, it's very strange that your sample string doesn't demonstrate that.Willner
HELLO is 72 00 69 00 76 00 76 00 79 00 in little-endian byte order. The "end" in "endian" actually means the "front end" of the sequence: "In big-endian format, the most significant byte is stored first (has the lowest address) or sent first, then the following bytes are stored or sent in decreasing significance order, with the least significant byte stored last (having the highest address) or sent last." en.wikipedia.org/wiki/EndiannessMamba
B
5

Here you can read a bit more of Wide Characters: http://en.wikipedia.org/wiki/Wide_character#Size_of_a_wide_character

Terminations are L'\0', means a 16-bit null so it's like two 8-bit null chars.

Remember that "009A" is only 1 wchar so is not a null wchar.

Baksheesh answered 6/9, 2012 at 18:12 Comment(0)
S
5

In C (quoting the N1570 draft, section 7.1.1):

A wide string is a contiguous sequence of wide characters terminated by and including the first null wide character.

where a "wide character" is a value of type wchar_t, which is defined in <stddef.h> as an integer type.

I can't find a definition of "wide string" in the N3337 draft of the C++ standard, but it should be similar. One minor difference is that wchar_t is a typedef in C, and a built-in type (whose name is a keyword) in C++. But since C++ shares most of the C library, including functions that act on wide strings, it's safe to assume that the C and C++ definitions are compatible. (If someone can find something more concrete in the C++ standard, please comment or edit this paragraph.)

In both C and C++, the size of a wchar_t is implementation-defined. It's typically either 2 or 4 bytes (16 or 32 bits, unless you're on a very exotic system with bytes bigger than 8 bits). A wide string is a sequence of wide characters (wchar_t values), terminated by a null wide character. The terminating wide character will have the same size as any other wide character, typically either 2 or 4 bytes.

In particular, given that wchar_t is bigger than char, a single null byte does not terminate a wide string.

It's also worth noting that byte order is implementation-defined. A wide character with the value 0x1234, when viewed as a sequence of 8-bit bytes, might appear as any of:

  • 0x12, 0x34
  • 0x34, 0x12
  • 0x00, 0x00, 0x12, 0x34
  • 0x34, 0x12, 0x00, 0x00

And those aren't the only possibilities.

Symbology answered 6/9, 2012 at 19:36 Comment(0)
F
1

if you declare

WCHAR tempWchar[BUFFER_SIZE];

you make it null

for (int i = 0; i < BUFFER_SIZE; i++)
            tempWchar[i] = NULL;
Fisken answered 2/11, 2016 at 19:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.