How to convert a WideString (or other long string) to byte array in UTF-8?
A function like this will do what you need:
function UTF8Bytes(const s: UTF8String): TBytes;
begin
Assert(StringElementSize(s)=1);
SetLength(Result, Length(s));
if Length(Result)>0 then
Move(s[1], Result[0], Length(s));
end;
You can call it with any type of string and the RTL will convert from the encoding of the string that is passed to UTF-8. So don't be tricked into thinking you must convert to UTF-8 before calling, just pass in any string and let the RTL do the work.
After that it's a fairly standard array copy. Note the assertion that explicitly calls out the assumption on string element size for a UTF-8 encoded string.
If you want to get the zero-terminator you would write it so:
function UTF8Bytes(const s: UTF8String): TBytes;
begin
Assert(StringElementSize(s)=1);
SetLength(Result, Length(s)+1);
if Length(Result)>0 then
Move(s[1], Result[0], Length(s));
Result[high(Result)] := 0;
end;
Assert...
line. But since you tagged your question Delphi
, and not free-pascal
, @David's answer applies to Delphi, and not Free Pascal. But the code above might work in Free Pascal, too. I don't know. Try it. –
Fixative You can use TEncoding.UTF8.GetBytes
in SysUtils.pas
GetBytes
will be very wasteful. The compiler will convert the input string to UnicodeString since that's the only string argument GetBytes
allows, and the GetBytes
will convert the characters back to UTF-8 to generate its result. –
Isoagglutinin If you're using Delphi 2009 or later (the Unicode versions), converting a WideString to a UTF8String is a simple assignment statement:
var
ws: WideString;
u8s: UTF8String;
u8s := ws;
The compiler will call the right library function to do the conversion because it knows that values of type UTF8String have a "code page" of CP_UTF8
.
In Delphi 7 and later, you can use the provided library function Utf8Encode
. For even earlier versions, you can get that function from other libraries, such as the JCL.
You can also write your own conversion function using the Windows API:
function CustomUtf8Encode(const ws: WideString): UTF8String;
var
n: Integer;
begin
n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), nil, 0, nil, nil);
Win32Check(n <> 0);
SetLength(Result, n);
n := WideCharToMultiByte(cp_UTF8, 0, PWideChar(ws), Length(ws), PAnsiChar(Result), n, nil, nil);
Win32Check(n = Length(Result));
end;
A lot of the time, you can simply use a UTF8String as an array, but if you really need a byte array, you can use David's and Cosmin's functions. If you're writing your own character-conversion function, you can skip the UTF8String and go directly to a byte array; just change the return type to TBytes
or array of Byte
. (You may also wish to increase the length by one, if you want the array to be null-terminated. SetLength will do that to the string implicitly, but to an array.)
If you have some other string type that's neither WideString, UnicodeString, nor UTF8String, then the way to convert it to UTF-8 is to first convert it to WideString or UnicodeString, and then convert it back to UTF-8.
var S: UTF8String;
B: TBytes;
begin
S := 'Șase sași în șase saci';
SetLength(B, Length(S)); // Length(s) = 26 for this 22 char string.
CopyMemory(@B[0], @S[1], Length(S));
end.
Depending on what you need the bytes for, you might want to include an NULL terminator.
For production code make sure you test for empty string. Adding the 3-4 LOC required would just make the sample harder to read.
'Șase sași în șase saci'
–
Dockery Length
function really works! –
Fixative 'Șase sași în șase saci'
. –
Lumbago I have the following two routines (source code can be downloaded here - http://www.csinnovations.com/framework_utilities.htm):
function CsiBytesToStr(const pInData: TByteDynArray; pStringEncoding: TECsiStringEncoding; pIncludesBom: Boolean): string;
function CsiStrToBytes(const pInStr: string; pStringEncoding: TECsiStringEncoding; pIncludeBom: Boolean): TByteDynArray;
widestring -> UTF8:
http://www.freepascal.org/docs-html/rtl/system/utf8decode.html
the opposite:
http://www.freepascal.org/docs-html/rtl/system/utf8encode.html
Note that assigning a widestring to an ansistring in a pre D2009 system (including current Free Pascal) will convert to the local ansi encoding, garbling characters.
For the TBytes part, see the remark of Rob Kennedy above.
© 2022 - 2024 — McMap. All rights reserved.