UTF-8: how many bytes are used by languages to represent a visible character? [closed]

Asked 23/1, 2013 at 17:21 Answered 24/1, 2013 at 6:37

Solved unicode utf-8 character byte glyph

Does there exist a table or something similar which shows how many bytes different languages need on average to represent a visible character (glyph) when the encoding is utf8?

Stealage answered 23/1, 2013 at 17:21 Comment(11)

By language you mean a human language, like English or Indonesian ? – Sumac 23/1, 2013 at 17:22

I'm not sure if this question is even well-defined, but even if it does, it appears to be very meaningful. Why are you asking this? Maybe we can address your actual problem better. (Also, preemptively in case you're thinking of avoiding UTF-8 to save space: utf8everywhere.org) – Theatrician 23/1, 2013 at 17:25

You could infer it from utf8-chartable.de – Musjid 23/1, 2013 at 17:28

interesting, but I'm not sure it makes sense to ask "on average"... – Proportionable 23/1, 2013 at 17:40

@DanieleB: It could be averaged over some large body of text in the language. Choosing such a body of text without biasing the results could be very difficult. The average could vary depending on the density of digits, punctuation, and even the use of English or other loanwords. – Lobate 23/1, 2013 at 17:58

For example: I am fetching data from a long database field value which holds utf8 from a certain language; the average bytes/glyph of that language is 2.5. When I need the first 200 glyphs I let the database truncate the data after 600 bytes (to save memory). Than I might have a good chance to get 200 or more glyphs after an adequate processing of the data. – Stealage 23/1, 2013 at 18:56

@dystroy, yes, a human language. – Stealage 23/1, 2013 at 18:57

This is all rather a bad idea. Truncated text is still bad data. You just don't have to when dbase engines support variable length strings, like nvarchar. You'll get smaller dbases that way. – Coumarin 23/1, 2013 at 19:55

Then the question makes no sense at all, you can only read what's there. – Coumarin 23/1, 2013 at 21:8

@HansPassant, I don't create databases, I only read the databases/tables. – Stealage 24/1, 2013 at 7:44

@HansPassant, I need only what is there or less. I don't understand what you mean. – Stealage 24/1, 2013 at 7:47

If you want something general, I think you should stick with this:

English takes very slightly more than 1 byte per character (there is the occasional non-ASCII character, often punctuation or symbols embedded in text).
Most other languages which use the latin alphabet use somewhat more than 1, but I would be surprised if you should expect more than, say, 1.5.
Languages using some of the other scripts (Greek, etc...) take around 2 bytes per character.
East Asian languages take about 3 bytes per character (spacing, control characters, and embedded ASCII make it take less, non-BMP makes it take more).

That's all very incomplete, approximate, and non-quantitative.

If you need something more quantitative, I think you will have to research each language individually. I doubt you will find precomputed results out there that already apply to a host of different languages.

If you have a corpus of text for a language, it's easy to calculate the average number of bytes required. Start with the Text corpus Wikipedia page. It links to at least one good freely available corpus for English and there might be some available for other languages as well (I didn't hunt through the links to find out).

Incidentally, I don't recommend using this information to truncate the length of a database field as you indicated (in comments) that you intend to do. First of all, if you used a corpus made up from litterature to come up with your expected number of bytes per character, you might find the corpus is not at all representative of the short little text strings that end up in your database, throwing off your expectation. Just get the whole database column. Most results will be much shorter than the maximum length, and when they're not, I don't think your optimization is worth it to save a hundred bytes or so.

Yoghurt answered 23/1, 2013 at 19:15 Comment(1)

I need/can use at most the length of a terminal row. – Stealage 24/1, 2013 at 7:54

Look at a list of Unicode blocks and their code point ranges, e.g. the browsable http://www.fileformat.info/info/unicode/block/index.htm or the official http://www.unicode.org/Public/UNIDATA/Blocks.txt :

Anything up to U+007F takes 1 byte: Basic Latin
Then up to U+07FF it takes 2 bytes: Greek, Arabic, Cyrillic, Hebrew, etc
Then up to U+FFFF it takes 3 bytes: Chinese, Japanese, Korean, Devanagari, etc
Beyond that it takes 4 bytes

Mcmillin answered 24/1, 2013 at 6:37 Comment(1)

It's difficult to know if it is important to support 4 byte UTF8. The characters >= U+10000 require four bytes and hence utf8mb4 rather than utf8 for mysql storage for example. There are symbols which fonts do support on OS X above U+10000 as well as some additional CJK characters. My conclusion at the moment is that if Chinese language support is important to you, that you should support 4 byte UTF-8 and allow a fuller range of characters. – Cuspid 5/10, 2015 at 19:42

Recommended topics

Hot tags