at all times text encoded in UTF-8 will never give us more than a +50% file size of the same text encoded in UTF-16. true / false?
Asked Answered
E

4

4

Somewhere I read (rephrased):

If we compare a UTF-8 encoded file VS a UTF-16 encoded file, At some times, the UTF-8 file may give a 50% to 100% larger file size

Am I right to say that the article is wrong because at all times, text encoded in UTF-8 will never give us more than a +50% file size of the same text encoded in UTF-16?

Enrol answered 30/7, 2011 at 13:34 Comment(2)
The quote is misleading in many of the common cases. If the Asian text is presented in a markup language like HTML or XML, the ASCII-only markup tags dominate and thus reduce the size to a point that UTF-8 can win over UTF-16 even in Asian texts. Strange but true.Hearthstone
but that has nothing to do with the question.. html and xml has nothing to do with this question. i've edited the question to make it more on point.Enrol
H
28

The answer is that in UTF-8, ASCII is just 1 byte, but that in general, most Western languages including English use a few characters here and there that require 2 bytes, so actual percentages vary. The Greek and Cyrillic languages all require at least 2 bytes per character in their script when encoded in UTF-8.

Common Eastern languages require for their characters 3 bytes in UTF-8 but 2 in UTF-16. Note however that “uncommon” Eastern characters require 4 bytes in both UTF-8 and UTF-16 alike.

3 is indeed only 50% greater than 2. But that is for a single code point only. It does not apply to an entire file.

The actual percentage is impossible to state with precision, because you do not know whether the balance of code points down in the 1- or 2-byte UTF-8 range, or in the 4-byte UTF-8 range. If there is white space in the Asian text, then that is only byte of UTF-8, and yet it is a costly 2 bytes of UTF-16.

These things do vary. You can only get precise numbers on precise text, not on general text. Code points in Asian text take 1, 2, 3, or 4 bytes of UTF-8, while in UTF-16 they variously require 2 or 4 bytes apiece.

Case Study

Compare the various languages’ Wikipedia pages on Tokyo to see what I mean. Even in Eastern languages, there is still plenty of ASCII going on. This alone makes your figures fluctuate. Consider:

Paras Lines Words Graphs Chars  UTF16 UTF8   8:16 16:8  Language

 519  1525  6300  43120 43147  86296 44023   51% 196%  English
 343   728  1202   8623  8650  17302  9173   53% 189%  Welsh
 541  1722  9013  57377 57404 114810 59345   52% 193%  Spanish
 529  1712  9690  63871 63898 127798 67016   52% 191%  French
 321   837  2442  18999 19026  38054 21148   56% 180%  Hungarian

 202   464   976   7140  7167  14336 11848   83% 121%  Greek
 348   937  2938  21439 21467  42936 36585   85% 117%  Russian

 355   788   613   6439  6466  12934 13754  106%  94%  Chinese, simplified
 209   419   243   2163  2190   4382  3331   76% 132%  Chinese, traditional
 461  1127  1030  25341 25368  50738 65636  129%  77%  Japanese
 410   925  2955  13942 13969  27940 29561  106%  95%  Korean

Each of those is the Tokyo Wikipedia page saved as text, not as HTML. All text is in NFC, not in NFD. The meaning of each of the columns is as follows:

  1. Paras is the number of blankline separated text spans.
  2. Lines is the number of linebreak separated text spans.
  3. Words is the number of whitespace separated text spans.
  4. Graphs is the number of Unicode extended grapheme clusters, sometimes called glyphs. These are user-visible characters.
  5. Chars is the number of Unicode code points. These are, or should be, programmer-visible characters.
  6. UTF16 is how many bytes that takes up when the file is stored as UTF-16.
  7. UTF8 is how many bytes that takes up when the file is stored as UTF-8.
  8. 8:16 is the ratio of UTF-8 size to UTF-16 size, expressed as a percentage.
  9. 16:8 is the ratio of UTF-16 size to UTF-8 size, expressed as a percentage.
  10. Language is which version of the Tokyo page we’re talking about here.

I’ve grouped the languages into Western Latin, Western non-Latin, and Eastern. Observations:

  1. Western languages that use the Latin script suffer terribly upon conversion from UTF-8 to UTF-16, with English suffering the most by expanding by 96% and Hungarian the least by expanding by 80%. All are huge.

  2. Western languages that do not use the Latin script still suffer, but only 15-20%.

  3. Eastern languages DO NOT SUFFER in UTF-8 the way everyone claims that they do! Behold:

    • In Korean and in (simplified) Chinese, you get only 6% bigger in UTF-8 than in UTF-16.
    • In Japanese, you get only 29% bigger in UTF-8 than in UTF-16.
    • The traditional Chinese actually got smaller in UTF-8 than in UTF-16! In fact, it costs 32% to use UTF-16 over UTF-8 for this sample. If you look at the Lines and Words columns, it looks that this might be due to white space usage.

I hope that answers your question. There is simply no +50% to +100% size increase for Eastern languages when encoded in UTF-8 compared to when these same texts are encoded in UTF-16. Only when taking individual code points do you ever see numbers like that, which is a completely unreasonable metric.

Hearthstone answered 30/7, 2011 at 17:2 Comment(6)
btw, is this statement right: take an input (any input) and encode it in UTF-8 and UTF-16. The UTF-8 file will NEVER be over 50% greater in file size compared to the UTF-16 file for ALL possible inputs. true / false?Enrol
Yes, it is true. And rarely met. In the ASCII range, UTF-16 is 100% bigger than UTF-8. or 2:1. In the non-ASCII BMP range, UTF-8 is 50% bigger than UTF-16, which is 3:2 or 2:3 depending which way you look at it. In the non-BMP range, they are the same size, or 1:1.Hearthstone
Wow, what a thorough answer! But are there really actively used asian languages that have 4-byte-characters?Countercurrent
@Michael Barth: Yes, there are absolutely are actively-used “Asian” languages that use code points that encode to 4 bytes in both UTF-8 and UTF-16. Look in the CJK Compatibility Ideographs Supplement block. There are other non-BMP CJK blocks to look at, also, like CJK Unified Ideographs Extension B. CJK Unified Ideographs Extension C, and CJK Unified Ideographs Extension D. Yes, these really are used.Hearthstone
@Hearthstone heys btw could you update your answer to answer the question in a more direct way? thx for the trouble =DEnrol
@Pacier I feel that a mere "yes" or "no", which is the technical answer to a Boolean question, is misleadingly brief. If I thought otherwise, I would have originally answered in that fashion, and what good would that have done you? From a techical perspective, there are about 16.5x code points that either stay the same size or shrink compared to those that grow +50% when downgrading form UTF-8 to UTF-16. From a practical perspective, this is immaterial because who have to look at the entire corpus not singular code points. Make the answer too short and it risks leading readers into error.Hearthstone
C
11

Yes, you are correct. Code points in the range U+0800..U+FFFF gives a +50% size.

                   UTF-8   UTF-16
U+0000..U+007F       1        2
U+0080..U+07FF       2        2
U+0800..U+FFFF       3        2
U+010000..U+10FFFF   4        4
Calendula answered 10/8, 2011 at 8:47 Comment(0)
C
2

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Though UTF-8 characters may use up to 4 bytes (and more is theoretically possible), it is not used for the Basic Multilingual Plane which includes "almost all modern languages".

Three bytes are needed for the rest of the Basic Multilingual Plane (which contains virtually all characters in common use). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts.

So I guess a 100% overhead, though theoretically possible, is not possible with a typical modern language. You'd have to use something exotic from the Supplementary Multilingual Plane, which uses 4 bytes in UTF-8, to achieve this.

For HTML documents or mixed text it's may not be necessary to switch to UTF-16 to save space:

Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8 if there are more of these characters than there are ASCII characters. This happens for pure text, but rarely for HTML documents. For example, both the Japanese UTF-8 and the Hindi Unicode articles on Wikipedia take more space if saved as UTF-16 than the original UTF-8 version.

See the UTF-8 to UTF-16 comparison on Wikipedia.


Joel Spolsky wrote a great article about Unicode, I can really recommend it:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Countercurrent answered 30/7, 2011 at 13:45 Comment(4)
No, strict UTF-8 as currently defined which is in the 0 .. 0x10FFFF range, can only occupy 4 bytes. Only the original spec, or a direct conversion of UCS-4 without regard to the artificial boundaries of UTF-16, would let you go to 6 bytes. Perl on 64-bit boxen can use code points that take up 13 bytes to encode in the UTF-8 style, although those are trans-Unicodian points. See my three Unicode talks from OSCON; which of those to read depends on how much and what sort of orientation you need. There is also a bit of redundancy between talks.Hearthstone
that is exactly why i'm asking the question. why will UTF-8 characters give 50% to 100% larger file size than the same thing encoded in UTF-16?Enrol
@michael thanks. btw isn't it true that all the characters in the Supplementary Multilingual Plane are surrogate pairs (and hence it will take UTF16 4 bytes too!) ? so when you say 100% is the theoretical maximum isn't that wrong since 50% is the theoretical maximum and anything above 50% is just impossible (4 bytes vs 4 bytes is 0% increase)Enrol
The PubMed Central Open Access corpus, which is an all-English Unicode corpus spanning about 11 gigabytes, contains 1339 distinct non-ASCII code points, most with many many repeats. Of these 1339, only 58 distinct code points (with repeats) lie outside the BMP. Save for one private-use code point alone, all of these are in the Unicode Mathematical_Alphanumeric_Symbols block. See slides 4 and 5 at the start of my OSCON talk on Unicode Support Shootout: The Good, the Bad, and the (mostly) Ugly to see what I mean.Hearthstone
D
0

If you have one byte for the character and add on a second byte, I'd call that a 100% increase, not 50%. I think that's what the author means.

If I write X characters with N bytes/character to a file I'll have NX bytes in that file. So you can see where doubling or tripling the number of bytes per character will have a linear effect on the size of the file.

Deodar answered 30/7, 2011 at 13:40 Comment(1)
UTF16 are 2 bytes each. so if you add one byte it is a 50% increase. but the article is claiming we will have 50% to 100% increase file size. Hence this question, why that claim? what's the justification behind that claim? and i right to say that at worst we will only suffer a 50% increase file size?Enrol

© 2022 - 2024 — McMap. All rights reserved.