This is a popular question. It is important to understand what the question author is asking, and that it is different from what is likely the most common need. To discourage misuse of the code where it is not needed, I've answered the latter first.
Common Need
Every string has a character set and encoding. When you convert a System.String
object to an array of System.Byte
you still have a character set and encoding. For most usages, you'd know which character set and encoding you need and .NET makes it simple to "copy with conversion." Just choose the appropriate Encoding
class.
// using System.Text;
Encoding.UTF8.GetBytes(".NET String to byte array")
The conversion may need to handle cases where the target character set or encoding doesn't support a character that's in the source. You have some choices: exception, substitution, or skipping. The default policy is to substitute a '?'.
// using System.Text;
var text = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes("You win €100"));
// -> "You win ?100"
Clearly, conversions are not necessarily lossless!
Note: For System.String
the source character set is Unicode.
The only confusing thing is that .NET uses the name of a character set for the name of one particular encoding of that character set. Encoding.Unicode
should be called Encoding.UTF16
.
That's it for most usages. If that's what you need, stop reading here. See the fun Joel Spolsky article if you don't understand what encoding is.
Specific Need
Now, the question author asks is, "Every string is stored as an array of bytes, right? Why can't I simply have those bytes?"
He doesn't want any conversion.
From the C# spec:
Character and string processing in C# uses Unicode encoding. The char
type represents a UTF-16 code unit, and the string type represents a
sequence of UTF-16 code units.
So, we know that if we ask for the null conversion (i.e., from UTF-16 to UTF-16), we'll get the desired result:
Encoding.Unicode.GetBytes(".NET String to byte array")
But to avoid the mention of encodings, we must do it another way. If an intermediate data type is acceptable, there is a conceptual shortcut for this:
".NET String to byte array".ToCharArray()
That doesn't get us the desired datatype but Mehrdad's answer shows how to convert this Char array to a Byte array using BlockCopy. However, this copies the string twice! And, it too explicitly uses encoding-specific code: the datatype System.Char
.
The only way to get to the actual bytes the String is stored in is to use a pointer. The fixed
statement allows taking the address of values. From the C# spec:
[For] an expression of type string, ... the initializer computes the
address of the first character in the string.
To do so, the compiler writes code skipping over the other parts of the string object with RuntimeHelpers.OffsetToStringData
. So, to get the raw bytes, just create a pointer to the string and copy the number of bytes needed.
// using System.Runtime.InteropServices
unsafe byte[] GetRawBytes(String s)
{
if (s == null) return null;
var codeunitCount = s.Length;
/* We know that String is a sequence of UTF-16 code units
and such code units are 2 bytes */
var byteCount = codeunitCount * 2;
var bytes = new byte[byteCount];
fixed(void* pRaw = s)
{
Marshal.Copy((IntPtr)pRaw, bytes, 0, byteCount);
}
return bytes;
}
As @CodesInChaos pointed out, the result depends on the endianness of the machine. But the question author is not concerned with that.
char
is astruct
that just happens to currently store values as a 16-bit number (UTF-16). What you're really asking (get the character bytes) isn't theoretically possible because it doesn't theoretically exist. Achar
orstring
has no Encoding by definition. What if the memory representation changed to UTF-32? Your "get the bytes, shove them back" would fail due to Encoding because you avoided Encoding. So "Why this dependency on encoding?!!!" Depend on Encoding so your code is dependable. – MarcumSystem.Text.Encoding.Unicode.GetBytes();
is doing some kind of expensive conversion that you want to avoid? If so, your assumption is wrong. – Rabbitvar array1 = yourString.ToCharArray();
If for some reason you want the code units asUInt16
values, dovar array2 = Array.ConvertAll<char, ushort>(array1, x => x);
. That is aushort[]
there. – Flameproof