Length() vs Sizeof() on Unicode strings
Asked Answered
B

1

8

Quoting the Delphi XE8 help:

For single-byte and multibyte strings, Length returns the number of bytes used by the string. Example for UTF-8:

   Writeln(Length(Utf8String('1¢'))); // displays 3

For Unicode (WideString) strings, Length returns the number of bytes divided by two.

This arises important questions:

  1. Why the difference in handling is there at all?
  2. Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?
  3. Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.
Boni answered 3/6, 2015 at 12:13 Comment(3)
try LenInBytes := Length(UTF8Encode('строка')) or var u8: UTF8String; u8 := 'строка'; I := Length(u8) - without typecastVariole
I wrote this because it's literally written in the name of the encoding how much it takes to encode a character. I just confused it with another encoding.Boni
I reverted all your edits. Mostly because I don't particularly want to have to keep updating my answer to match! ;-) Anyway, I think it's clear that you are on top of this now. The question is a good one. Can't we leave it as is.Sosa
S
13

Length returns the number of elements when considering the string as an array.

  • For strings with 8 bit element types (ANSI, UTF-8) then Length gives you the number of bytes since the number of bytes is the same as the number of elements.
  • For strings with 16 bit elements (UTF-16) then Length is half the number of bytes because each element is 2 bytes wide.

Your string '1¢' has two code points, but the second code point requires two bytes to encode it in UTF-8. Hence Length(Utf8String('1¢')) evaluates to three.

You mention SizeOf in the question title. Passing a string variable to SizeOf will always return the size of a pointer, since a string variable is, under the hood, just a pointer.

To your specific questions:

Why the difference in handling is there at all?

There is only a difference if you think of Length as relating to bytes. But that's the wrong way to think about it Length always returns an element count, and when viewed that way, there behaviour is uniform across all string types, and indeed across all array types.

Why Length() doesn't do what it's expected to do, return just the length of the parameter (as in, the count of elements) instead of giving the size in bytes in some cases?

It does always return the element count. It just so happens that when the element size is a single byte, then the element count and the byte count happen to be the same. In fact the documentation that you refer to also contains the following just above the excerpt that you provided: Returns the number of characters in a string or of elements in an array. That is the key text. The excerpt that you included is meant as an illustration of the implications of this italicised text.

Why does it state it divides the result by 2 for Unicode (UTF-16) strings? AFAIK UTF-16 is 4-byte at most, and thus this will give incorrect results.

UTF-16 character elements are always 16 bits wide. However, some Unicode code points require two character elements to encode. These pairs of character elements are known as surrogate pairs.


You are hoping, I think, that Length will return the number of code points in a string. But it doesn't. It returns the number of character elements. And for variable length encodings, the number of code points is not necessarily the same as the number of character elements. If your string was encoded as UTF-32 then the number of code points would be the same as the number of character elements since UTF-32 is a constant sized encoding.

A quick way to count the code points is to scan through the string checking for surrogate pairs. When you encounter a surrogate pair, count one code point. Otherwise, when you encounter a character element that is not part of a surrogate pair, count one code point. In pseudo-code:

N := 0;
for C in S do
  if C.IsSurrogate then
    inc(N)
  else
    inc(N, 2);
CodePointCount := N div 2;

Another point to make is that the code point count is not the same as the visible character count. Some code points are combining characters and are combined with their neighbouring code points to form a single visible character or glyph.

Finally, if all you are hoping to do is find the byte size of the string payload, use this expression:

Length(S) * SizeOf(S[1])

This expression works for all types of string.

Be very careful about the function System.SysUtils.ByteLength. On the face of it this seems to be just what you want. However, that function returns the byte length of a UTF-16 encoded string. So if you pass it an AnsiString, say, then the value returned by ByteLength is twice the number of bytes of the AnsiString.

Sosa answered 3/6, 2015 at 12:16 Comment(13)
Look at the code in my question. "1¢" is just two characters long, but the output is nonetheless 3.Boni
@Boni It's because UTF-8 vary from 1 to 4 bytes to represent characters. This cyrillic character you wrote occupies is 2 bytesSmelt
Yes, that what I'm talking about. If it had returned element count as expected, it would be 2, but it returned the size of the string, that is a work for SizeOf().Boni
@Boni That's right. You have two code points, but the UTF-8 encoded byte array has length 3.Sosa
No SizeOf always returns the size of the pointer, either 4 or 8 depending on your target platform.Sosa
So what's the ultimate way of finding a string's size. I need it to do something based of how much memory it occupies.Boni
Length(s)*SizeOf(s[1]) gives you the number of bytes occupied by the string.Sosa
I think about converting into a TBytes and using Length() on it.Boni
@Boni Don't do that! That will involve pointless heap allocation. Use the simple expression in my previous comment.Sosa
@Smelt Length(s)*SizeOf(s[1]) gives the byte count for all string types, not just ones with 8 bit character elements.Sosa
This may help to get in deep Delphi and UnicodeDesultory
@DavidHeffernan, assuming it's not empty :P It might very well be.Boni
No. It's just fine for an empty string too. Then it returns 0. SizeOf() is evaluated at compile time.Sosa

© 2022 - 2024 — McMap. All rights reserved.