UTF-8: General? Bin? Unicode?

Asked 26/2, 2010 at 19:3 Answered 10/12, 2018 at 14:44

296

I'm trying to figure out what collation I should be using for various types of data. 100% of the content I will be storing is user-submitted.

My understanding is that I should be using UTF-8 General CI (Case-Insensitive) instead of UTF-8 Binary. However, I can't find a clear a distinction between UTF-8 General CI and UTF-8 Unicode CI.

Should I be storing user-submitted content in UTF-8 General or UTF-8 Unicode CI columns?
What type of data would UTF-8 Binary be applicable to?

Irate answered 26/2, 2010 at 19:3 Comment(4)

Side note but instead of utf8, use utf8mb4 instead for full UTF-8 support. Commenting here because the answers on this popular question do not address this. mathiasbynens.be/notes/mysql-utf8mb4 – Propagandize 6/1, 2016 at 19:33

If you want case folding, but accent sensitivity, please file a request at bugs.mysql.com . – Nellienellir 14/3, 2017 at 22:47

Or click "Affects Me" on bugs.mysql.com/bug.php?id=58797 and add a comment. – Nellienellir 6/6, 2017 at 20:48

Now that 8.0 is common, much of this Question and the Answers are out of date. (Feel free to start a new Question to get a more targeted answer.) – Nellienellir 2/9, 2021 at 17:49

307

In general, utf8_general_ci is faster than utf8_unicode_ci, but less correct.

Here is the difference:

For any Unicode character set, operations performed using the _general_ci collation are faster than those for the _unicode_ci collation. For example, comparisons for the utf8_general_ci collation are faster, but slightly less correct, than comparisons for utf8_unicode_ci. The reason for this is that utf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example, in German and some other languages “ß” is equal to “ss”. utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

Quoted from: http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

For more detailed explanation, please read the following post from MySQL forums: http://forums.mysql.com/read.php?103,187048,188748

As for utf8_bin: Both utf8_general_ci and utf8_unicode_ci perform case-insensitive comparison. In constrast, utf8_bin is case-sensitive (among other differences), because it compares the binary values of the characters.

Quesada answered 26/2, 2010 at 19:7 Comment(6)

I think that if you don't have a good reason to use _unicode_ci, then use _general_ci. – Quesada 26/2, 2010 at 19:12

This doesn't really answer the question in depth though. What is the difference between these collations exactly? – Droughty 2/4, 2011 at 22:34

You are right, the exact difference is not provided here for sake of simplicity. I've added a link to a post with the exact difference. – Quesada 16/9, 2011 at 16:43

NB show collation; allows you to see the default collation for each character set. 5.1 shows utf8_general_ci as default for utf8. – Unconscious 16/7, 2012 at 14:53

Are there any resources that would go more in-depth in the actual speed difference between the two collations? Are we talking about a 0.1% drop in performance or a 10% drop? – Livia 4/3, 2013 at 18:11

Does utf8-bin collation means exact binary match? – Squeak 11/3, 2014 at 11:54

You should also be aware of the fact, that with utf8_general_ci when using a varchar field as unique or primary index inserting 2 values like 'a' and 'á' would give a duplicate key error.

Baur answered 19/1, 2011 at 14:11 Comment(2)

Thanks, this is useful to avoid similar usernames (e.g. if "jose" exists, I wouldn't want someone else to create a "josé" user) NB: this also holds true for most of the utf8 collations (except utf8_bin). The surest/safest/most comprehensive is utf8_unicode_ci – Vestpocket 10/4, 2013 at 3:12

I use utf8_bin where I want jose and josé to be distinguished in the index. For example, a column that records search/replace operations, where the user might have decided to search for josé, and replace it with jose. (I'm writing a spreadsheet program) – Imogene 9/5, 2013 at 19:56

utf8_bin compares the bits blindly. No case folding, no accent stripping.
utf8_general_ci compares one codepoint with one codepoint. It does case folding and accent stripping, but no 2-character comparisons; for example: ij is not equal ĳ in this collation.
utf8_*_ci is a set of language-specific rules, but otherwise like unicode_ci. Some special cases: Ç, Č, ch, ll
utf8_unicode_ci follows an old Unicode standard for comparisons. ij=ĳ, but ae != æ
utf8_unicode_520_ci follows a newer Unicode standard. ae = æ

See collation chart for details on what is equal to what in various utf8 collations.

utf8, as defined by MySQL, is limited to the 1- to 3-byte utf8 codes. This leaves out Emoji and some of Chinese. So you should really switch to utf8mb4 if you want to go much beyond Europe.

The above points apply to utf8mb4, after suitable spelling change. Going forward, utf8mb4 and utf8mb4_unicode_520_ci are preferred. Or (in 8.0) utf8mb4_0900_ai_ci

utf16 and utf32 are variants on utf8; there is virtually no use for them.
ucs2 is closer to "Unicode" than "utf8"; there is virtually no use for it.

Nellienellir answered 29/7, 2016 at 17:54 Comment(4)

Re "stay tuned": 8.0 collations shows how various characters, diphthongs, etc, compare in the 8.0 utf8mb4 collations; utf8 is mostly the same. – Nellienellir 15/2, 2017 at 22:55

And 8.0 collations are clocked at being significantly faster than 5.x. – Nellienellir 6/6, 2017 at 20:49

it would be nice if that page lists utf8mb4_bin at the top. I know it does no character matching at all, but it's good for newbies. – Matheny 19/7, 2019 at 8:58

@TobySpeight - Thanks. Now I am worried that I messed up on other answers; I have said things like that many times in the last decade. Now that 8.0 is the current version, many questions like this are not asked -- general_ci is not the default anymore. – Nellienellir 31/3, 2021 at 21:2

Accepted answer is outdated.

If you use MySQL 5.5.3+, use utf8mb4_unicode_ci instead of utf8_unicode_ci to ensure the characters typed by your users won't give you errors.

utf8mb4 supports emojis for example, whereas utf8 might give you hundreds of encoding-related bugs like:

Incorrect string value: ‘\xF0\x9F\x98\x81…’ for column ‘data’ at row 1

Haslet answered 10/12, 2018 at 14:44 Comment(1)

This Answer (correctly) addresses issues with encoding of Emoji (and some of Chinese). But the Question seems to be focused on Collation. utf8mb4_unicode_ci treats (I think) all Emoji as equal. utf8mb4_unicode_520_ci gives an ordering to Emoji. – Nellienellir 19/7, 2019 at 16:18

Really, I tested saving values like 'é' and 'e' in column with unique index and they cause duplicate error on both 'utf8_unicode_ci' and 'utf8_general_ci'. You can save them only in 'utf8_bin' collated column.

And mysql docs (in http://dev.mysql.com/doc/refman/5.7/en/charset-applications.html) suggest into its examples set 'utf8_general_ci' collation.

[mysqld]
character-set-server=utf8
collation-server=utf8_general_ci

Itagaki answered 8/7, 2014 at 9:36 Comment(2)

I did a quick test on this, and it appears to be accurate. Both collations behave the same when it comes to a unique key on a column and values with tildes and the like. – Redouble 30/6, 2015 at 0:19

@Redouble OK, I should add there that column should have unique index for causing this error. It implies in my answer. – Itagaki 1/7, 2015 at 7:9

Recommended topics

Hot tags