Utf8_general_ci or utf8mb4 or...?
Asked Answered
S

2

31

utf16 or utf32? I'm trying to store content in a lot of languages. Some of the languages use double-wide fonts (for example, Japanese fonts are frequently twice as wide as English fonts). I'm not sure which kind of database I should be using. Any information about the differences between these four charsets...

Swarthy answered 18/7, 2012 at 2:19 Comment(0)
A
43

MySQL's utf32 and utf8mb4 (as well as standard UTF-8) can directly store any character specified by Unicode; the former is fixed size at 4 bytes per character whereas the latter is between 1 and 4 bytes per character.

utf8mb3 and the original utf8 can only store the first 65,536 codepoints, which will cover CJVK (Chinese, Japanese, Vietnam, Korean), and use 1 to 3 bytes per character.

utf16 uses 2 bytes for the first 65,536 codepoints, and 4 bytes for everything else.

As for fonts, that's strictly a visual thing.

"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"

See also MySQL documentation for Unicode support.

Accuracy answered 18/7, 2012 at 2:25 Comment(5)
Just to be extra-clear, the comment about utf8_general applies to all the other utf8_* collations too; all will be using MySQL's utf8mb3 aka utf8 charset.Broadax
@JohnFlatness Thanks. Your comment is just what I was going to ask about next. I thought that UTF-16 used 2 bytes for Mandarin characters, for example, though? I'm looking at the documentation you gave me, hoping that it covers what 65,536 means. XDMoralist
Chinese characters are within the Basic Multilingual Plane (the first 65,536 codepoints).Accuracy
It seems like the latter 2 are the better options, space-wise.Moralist
utf8 and utf8mb3 do not cover all CJK characters, some of which are 4-byte wide.Remnant
S
0

utf8mb4 is the best.

utf8mb4 supports 4 bytes per character compared to utf8's 3 bytes per character, so it covers a wider range of uses without error.

With utf8mb4 you can support emojis, for example. If you try to insert an emoji in an unsupported character set you will get errors.

utf8mb4 is the more modern version of the 2 and will replace the older version eventually.

Sitnik answered 4/9, 2020 at 0:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.