What is the difference between utf8mb4 and utf8 charsets in MySQL?
Asked Answered
E

5

519

What is the difference between utf8mb4 and utf8 charsets in MySQL?

I already know about ASCII, UTF-8, UTF-16 and UTF-32 encodings; but I'm curious to know whats the difference of utf8mb4 group of encodings with other encoding types defined in MySQL Server.

Are there any special benefits/proposes of using utf8mb4 rather than utf8?

Emerick answered 6/5, 2015 at 10:45 Comment(2)
Good read to understand difference: eversql.com/…Aurelea
utf-8 can store only 1, 2 or 3 bytes characters, while utf8mb4 can store 4 bytes characters as well. utf-8 is a subset of characters given by utf8mb4. full stop.Seriocomic
S
573

UTF-8 is a variable-length encoding. In the case of UTF-8, this means that storing one code point requires one to four bytes. However, MySQL's encoding called "utf8" (alias of "utf8mb3") only stores a maximum of three bytes per code point.

So the character set "utf8"/"utf8mb3" cannot store all Unicode code points: it only supports the range 0x000 to 0xFFFF, which is called the "Basic Multilingual Plane". See also Comparison of Unicode encodings.

This is what (a previous version of the same page at) the MySQL documentation has to say about it:

The character set named utf8[/utf8mb3] uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters:

  • For a BMP character, utf8[/utf8mb3] and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.

  • For a supplementary character, utf8[/utf8mb3] cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8[/utf8mb3] cannot store the character at all, you do not have any supplementary characters in utf8[/utf8mb3] columns and you need not worry about converting characters or losing data when upgrading utf8[/utf8mb3] data from older versions of MySQL.

So if you want your column to support storing characters lying outside the BMP (and you usually want to), such as emoji, use "utf8mb4". See also What are the most common non-BMP Unicode characters in actual use?.

Saying answered 6/5, 2015 at 10:47 Comment(16)
The only cases I have encountered (so far) where utf8mb4 was 'required' is Chinese and Emoticons. There are obscure alphabets that need it.Abruzzi
Its also required if you use to keep encrypted passwords and data in your database. I was keeping encrypted password in mysql using normal utf8 format which caused me al lot of trouble with some passwords randomly and very hard to debug so finally I tried to use base64 encode and fixed the problem temporary. But, now I know the reason.Emerick
@idealidea encrypted data is binary, and you shouldn't store binary data in a varchar column. :)Saying
Can you add which emojis work in utf8 and which don't?Gaudreau
@Gaudreau all BMP characters fit in a MySQL UTF8-column. If there are any emojis in the BMP, they'll work.Saying
@thomasrutter Try this (𡞰) character to save with UTF-8. :)Netherlands
@MojtabaRezaeian it dependos in the password algorithm somewhat - bcrypt2 will produce ASCII.Pronounce
@D3adL0cK - Thanks for the example. The Chinese character 𡞰 is hex F0A19EB0, requiring utf8mb4.Abruzzi
@thomasrutter Chinese+Korean+Japanese characters are not all in the basic multilingual plane.Gildea
By saying "it only supports the range 0x000 to 0xFFFF", "it" stores a maximum of three bytes per code point, but "0xFFFF" only has two bytes, where is the other byte?Lilllie
@work because a code point needs to be encoded into bytes. It costs a few bits to indicate "this is the first part of a mutibyte code point, more bytes follow". See also fileformat.info/info/unicode/char/ffff/index.htm and fileformat.info/info/unicode/utf8.htm.Saying
@thomasrutter Please note that there are 89,092 Chinese-Japanese-Korean characters as of Unicode 12.1.0, which is far more than what 0x0000-0xFFFF can represent. “𡞰” (U+217B0) is an example SIP CJK character.Coagulum
I have come across problems with ligatures (e.g. æ) and even accented characters when copying text from PDFs with non-standard fonts into a mySQL database - I suspect these are due to this problem.Levitical
so given a string, how would you test if it respects utf8mb3 encoding ?Craving
I found this post to answer my question just above, hope it will help: #32469810Craving
lol thanks for all the people pointing out my previous comment was wrong about CJK ranges - I was indeed totally wrong. At any rate, this is a good answer to the question.Entero
L
94

The utf8mb4 character set is useful because nowadays we need support for storing not only language characters but also symbols, newly introduced emojis, and so on.

A nice read on How to support full Unicode in MySQL databases by Mathias Bynens can also shed some light on this.

Lazo answered 31/10, 2016 at 12:28 Comment(1)
MySQL 8.0 is now default to utf8mb4 character set. [mysql.com/products/enterprise/techspec.html]Hines
F
90

Taken from the MySQL 8.0 Reference Manual:

  • utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.

  • utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.

In MySQL utf8 is currently an alias for utf8mb3 which is deprecated and will be removed in a future MySQL release. At that point utf8 will become a reference to utf8mb4.

So regardless of this alias, you can consciously set yourself an utf8mb4 encoding.

To complete the answer, I'd like to add the @WilliamEntriken's comment below (also taken from the manual):

To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

Firstrate answered 14/9, 2018 at 18:25 Comment(0)
E
44
  • utf8 is MySQL's older, flawed implementation of UTF-8 which is in the process of being deprecated.
  • utf8mb4 is what they named their fixed UTF-8 implementation, and is what you should use right now.

In their flawed version, only characters in the first 64k character plane - the basic multilingual plane - work, with other characters considered invalid. The code point values within that plane - 0 to 65535 (some of which are reserved for special reasons) can be represented by multi-byte encodings in UTF-8 of up to 3 bytes, and MySQL's early version of UTF-8 arbitrarily decided to set that as a limit. At no point was this limitation a correct interpretation of the UTF-8 rules, because at no point was UTF-8 defined as only allowing up to 3 bytes per character. In fact, the earliest definitions of UTF-8 defined it as having up to 6 bytes (since revised to 4). MySQL's original version was always arbitrarily crippled.

Back when MySQL released this, the consequences of this limitation weren't too bad as most Unicode characters were in that first plane. Since then, more and more newly defined character ranges have been added to Unicode with values outside that first plane. Unicode itself defines 17 planes, though so far only 7 of these are used.

In an effort not to break old code making any particular assumptions, MySQL retained the broken implementation and called the newer, fixed version utf8mb4. This has led to some confusion with the name being misinterpreted as if it's some kind of extension to UTF-8 or alternative form of UTF-8, rather than MySQL's implementation of the true UTF-8.

Future versions of MySQL will eventually phase out the older version, and for now it can be considered deprecated. For the foreseeable future you need to use utf8mb4 to ensure correct UTF-8 encoding. After sufficient time has passed, the current utf8 will be removed, and at some future date utf8 will rise again, this time referring to the fixed version, though utf8mb4 will continue to unambiguously refer to the fixed version.

Entero answered 5/10, 2020 at 13:39 Comment(3)
flawed!? I don't think so. I've had bad experience to upgrade my table to utf8mb4. Know what happened? I got Incorrect Key file error happened someday. To correct your answer, utf8mb4 is to extend utf8, but it's flawed.Dirtcheap
That is misinformation. utf8mb4 is what should be used for proper UTF-8 support now. If you had a corrupted database and you got an incorrect key file error that is an unrelated matter.Entero
This is the best answer because it clarifies that "utf8mb3" "utf8mb4" are merely names that MySQL uses internally for its "flawed implementations" as Thomas Rutter says, and this versioning has nothing to do with the UTF-8 standard maintained by the Unicode Consortium.Tracytrade
M
5

MySQL added this utf8mb4 code after 5.5.3, Mb4 is the most bytes 4 meaning, specifically designed to be compatible with four-byte Unicode. Fortunately, UTF8MB4 is a superset of UTF8, except that there is no need to convert the encoding to UTF8MB4. Of course, in order to save space, the general use of UTF8 is enough.

The original UTF-8 format uses one to six bytes and can encode 31 characters maximum. The latest UTF-8 specification uses only one to four bytes and can encode up to 21 bits, just to represent all 17 Unicode planes. UTF8 is a character set in Mysql that supports only a maximum of three bytes of UTF-8 characters, which is the basic multi-text plane in Unicode.

To save 4-byte-long UTF-8 characters in Mysql, you need to use the UTF8MB4 character set, but only 5.5. After 3 versions are supported (View version: Select version ();). I think that in order to get better compatibility, you should always use UTF8MB4 instead of UTF8. For char type data, UTF8MB4 consumes more space and, according to Mysql's official recommendation, uses VARCHAR instead of char.

In MariaDB utf8mb4 as the default CHARSET when it not set explicitly in the server config, hence COLLATE utf8mb4_unicode_ci is used.

Refer MariaDB CHARSET & COLLATE Click

CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
Mukund answered 22/7, 2020 at 17:53 Comment(2)
No. In MariaDB the default CHARSET is latin1. (Unless your distro patched this for you.) mariadb.com/kb/en/character-set-and-collation-overview/…Cleareyed
Using the flawed implementation instead of utf8mb4 doesn't save space. The same string will occupy the same number of bytes in either, for any variable length column. For fixed length columns like CHAR, it depends on the storage engine used as to whether it is space optimized like VARCHAR (which I think innodb does by default) or it reserves the maximum number of bytes eg (max bytes per char) x (number of chars).Entero

© 2022 - 2024 — McMap. All rights reserved.