String to byte array in UTF-8?
Asked Answered
T

6

8

How to convert a WideString (or other long string) to byte array in UTF-8?

Tenacious answered 8/3, 2011 at 14:1 Comment(0)
L
13

A function like this will do what you need:

function UTF8Bytes(const s: UTF8String): TBytes;
begin
  Assert(StringElementSize(s)=1);
  SetLength(Result, Length(s));
  if Length(Result)>0 then
    Move(s[1], Result[0], Length(s));
end;

You can call it with any type of string and the RTL will convert from the encoding of the string that is passed to UTF-8. So don't be tricked into thinking you must convert to UTF-8 before calling, just pass in any string and let the RTL do the work.

After that it's a fairly standard array copy. Note the assertion that explicitly calls out the assumption on string element size for a UTF-8 encoded string.

If you want to get the zero-terminator you would write it so:

function UTF8Bytes(const s: UTF8String): TBytes;
begin
  Assert(StringElementSize(s)=1);
  SetLength(Result, Length(s)+1);
  if Length(Result)>0 then
    Move(s[1], Result[0], Length(s));
  Result[high(Result)] := 0;
end;
Lumbago answered 8/3, 2011 at 14:20 Comment(5)
@Cosmin No it will not. That's the thing about assertions!Lumbago
one question.. what unit do I have to add to use StringElementSize()?(lazarus). Sorry for such questions, im a newbieTenacious
@Tenacious What does your "lazarus" statement mean? You tagged the question Delphi. In Delphi it's in system.pas and so automatically used by all units.Lumbago
@Mariusz: You can remove the entire Assert... line. But since you tagged your question Delphi, and not free-pascal, @David's answer applies to Delphi, and not Free Pascal. But the code above might work in Free Pascal, too. I don't know. Try it.Fixative
It is D2009+ specific code, and thus will not work on FPC which follows pre D2009 semantics. Passing a widestring (see original question) to a "UTF8string" will change it to the local encoding (NOT UTF-8 like in D2009+), and thus garble the string. FPC has special documented functions for this, see separate answerSpiritualist
E
9

You can use TEncoding.UTF8.GetBytes in SysUtils.pas

Estevan answered 8/3, 2011 at 14:53 Comment(1)
Note that if the input string is already encoded as UTF-8, GetBytes will be very wasteful. The compiler will convert the input string to UnicodeString since that's the only string argument GetBytes allows, and the GetBytes will convert the characters back to UTF-8 to generate its result.Isoagglutinin
I
5

If you're using Delphi 2009 or later (the Unicode versions), converting a WideString to a UTF8String is a simple assignment statement:

var
  ws: WideString;
  u8s: UTF8String;

u8s := ws;

The compiler will call the right library function to do the conversion because it knows that values of type UTF8String have a "code page" of CP_UTF8.

In Delphi 7 and later, you can use the provided library function Utf8Encode. For even earlier versions, you can get that function from other libraries, such as the JCL.

You can also write your own conversion function using the Windows API:

function CustomUtf8Encode(const ws: WideString): UTF8String;
var
  n: Integer;
begin
  n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), nil, 0, nil, nil);
  Win32Check(n <> 0);
  SetLength(Result, n);
  n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), PAnsiChar(Result), n, nil, nil);
  Win32Check(n = Length(Result));
end;

A lot of the time, you can simply use a UTF8String as an array, but if you really need a byte array, you can use David's and Cosmin's functions. If you're writing your own character-conversion function, you can skip the UTF8String and go directly to a byte array; just change the return type to TBytes or array of Byte. (You may also wish to increase the length by one, if you want the array to be null-terminated. SetLength will do that to the string implicitly, but to an array.)

If you have some other string type that's neither WideString, UnicodeString, nor UTF8String, then the way to convert it to UTF-8 is to first convert it to WideString or UnicodeString, and then convert it back to UTF-8.

Isoagglutinin answered 8/3, 2011 at 15:1 Comment(0)
D
4
var S: UTF8String;
    B: TBytes;

begin
  S := 'Șase sași în șase saci';
  SetLength(B, Length(S)); // Length(s) = 26 for this 22 char string.
  CopyMemory(@B[0], @S[1], Length(S));
end.

Depending on what you need the bytes for, you might want to include an NULL terminator.

For production code make sure you test for empty string. Adding the 3-4 LOC required would just make the sample harder to read.

Dockery answered 8/3, 2011 at 14:9 Comment(5)
The string is not empty. It contains the value 'Șase sași în șase saci'Dockery
+1. Not everyone (to say the least!) knows how the Length function really works!Fixative
@Cosmin I can see that the string is not empty. I just have a feeling that the OP may be interested in text other than 'Șase sași în șase saci'.Lumbago
@Cosmin, @David: Surely @Cosmin was joking! (Indeed, David's point is very good.)Fixative
I want to send the bytes to my Java app thru the sockets.Tenacious
P
1

I have the following two routines (source code can be downloaded here - http://www.csinnovations.com/framework_utilities.htm):

function CsiBytesToStr(const pInData: TByteDynArray; pStringEncoding: TECsiStringEncoding; pIncludesBom: Boolean): string;

function CsiStrToBytes(const pInStr: string; pStringEncoding: TECsiStringEncoding; pIncludeBom: Boolean): TByteDynArray;

Proverb answered 8/3, 2011 at 23:51 Comment(0)
S
1

widestring -> UTF8:

http://www.freepascal.org/docs-html/rtl/system/utf8decode.html

the opposite:

http://www.freepascal.org/docs-html/rtl/system/utf8encode.html

Note that assigning a widestring to an ansistring in a pre D2009 system (including current Free Pascal) will convert to the local ansi encoding, garbling characters.

For the TBytes part, see the remark of Rob Kennedy above.

Spiritualist answered 9/3, 2011 at 12:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.