How can I search by emoji in MySQL using utf8mb4?
Asked Answered
P

2

17

Please help me understand how multibyte characters like emoji's are handled in MySQL utf8mb4 fields.

See below for a simple test SQL to illustrate the challenges.

/* Clear Previous Test */
DROP TABLE IF EXISTS `emoji_test`;
DROP TABLE IF EXISTS `emoji_test_with_unique_key`;

/* Build Schema */
CREATE TABLE `emoji_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `emoji_test_with_unique_key` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `string` varchar(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL DEFAULT '',
  `status` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_string_status` (`string`,`status`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

/* INSERT data */
# Expected Result is successful insert for each of these.
# However some fail. See comments.
INSERT INTO emoji_test (`string`, `status`) VALUES ('๐ŸŒถ', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('๐ŸŒฎ', 1);                   # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('๐ŸŒฎ๐ŸŒถ', 1);                 # SUCCESS
INSERT INTO emoji_test (`string`, `status`) VALUES ('๐ŸŒถ๐ŸŒฎ', 1);                 # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('๐ŸŒถ', 1);   # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('๐ŸŒฎ', 1);   # FAIL: Duplicate entry '?-1' for key 'idx_string_status'
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('๐ŸŒฎ๐ŸŒถ', 1); # SUCCESS
INSERT INTO emoji_test_with_unique_key (`string`, `status`) VALUES ('๐ŸŒถ๐ŸŒฎ', 1); # FAIL: Duplicate entry '??-1' for key 'idx_string_status'

/* Test data */

    /* Simple Table */
SELECT * FROM emoji_test WHERE `string` IN ('๐ŸŒถ','๐ŸŒฎ','๐ŸŒฎ๐ŸŒถ','๐ŸŒถ๐ŸŒฎ'); # SUCCESS (all 4 are found)
SELECT * FROM emoji_test WHERE `string` IN ('๐ŸŒถ');                     # FAIL: Returns both ๐ŸŒถ and ๐ŸŒฎ
SELECT * FROM emoji_test WHERE `string` IN ('๐ŸŒฎ');                     # FAIL: Returns both ๐ŸŒถ and ๐ŸŒฎ
SELECT * FROM emoji_test;                                              # SUCCESS (all 4 are found)

    /* Table with Unique Key */
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('๐ŸŒถ','๐ŸŒฎ','๐ŸŒฎ๐ŸŒถ','๐ŸŒถ๐ŸŒฎ'); # FAIL: Only 2 are found (due to insert errors above)
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('๐ŸŒถ');                     # SUCCESS
SELECT * FROM emoji_test_with_unique_key WHERE `string` IN ('๐ŸŒฎ');                     # FAIL: ๐ŸŒถ found instead of ๐ŸŒฎ
SELECT * FROM emoji_test_with_unique_key;                                              # FAIL: Only 2 records found (๐ŸŒถ and ๐ŸŒฎ๐ŸŒถ)

I'm interested in learning what causes the FAILs above and how I can get around this.

Specifically:

  1. Why do selects for one multibyte character return results for any multibyte character?
  2. How can I configure an index to handle multibyte characters instead of ??
  3. Can you recommend changes to the second CREATE TABLE (the one with a unique key) above in such a way that makes all the test queries return successfully?
Prosperus answered 14/12, 2016 at 16:44 Comment(3)
As any Mexican can tell you, ๐ŸŒฎ ('TACO' (U+1F32E)) and ๐ŸŒถ ('HOT PEPPER' (U+1F336)) are clearly related but different things. This must be the most wonderfully composed question in years. โ€“ Subeditor
Related: #38117484 : Solution is to use MySQL 5.6+ and to use utf8mb4_unicode_520_ci collation which doesn't treat all 4 bytes characters as equal - A pretty good reason to avoid emojis as passwords :) โ€“ Subeditor
@ÁlvaroGonzález Well if this is a problem for passwords, then there is a bigger bigger problem with the given setup, because passwords should be stored with a oneway hash. And for hashing, it shouldn't be a problem. But I also wouldn't suggest to use them for passwords. โ€“ Somnifacient
S
27

You use utf8mb4_unicode_ci for your columns, so the check is case insensitive. If you use utf8mb4_bin instead, then the emoji ๐ŸŒฎ and ๐ŸŒถ are correctly identified as different letters.

With WEIGHT_STRING you can get the values that are use for sorting and comparison for the input string.

If you write:

SELECT
  WEIGHT_STRING ('๐ŸŒฎ' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('๐ŸŒถ' COLLATE 'utf8mb4_unicode_ci')

Then you can see that both are 0xfffd. In Unicode Character Sets they say:

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER.

If you write:

SELECT 
  WEIGHT_STRING('๐ŸŒฎ' COLLATE 'utf8mb4_bin'),
  WEIGHT_STRING('๐ŸŒถ' COLLATE 'utf8mb4_bin')

You will get their unicode values 0x01f32e and 0x01f336 instead.

For other letters like ร„, ร and A that are equal if you use utf8mb4_unicode_ci, the difference can be seen in:

SELECT
  WEIGHT_STRING ('ร„' COLLATE 'utf8mb4_unicode_ci'),
  WEIGHT_STRING ('A' COLLATE 'utf8mb4_unicode_ci')

Those map to to the weight 0x0E33

ร„: 00C4  ; [.0E33.0020.0008.0041][.0000.0047.0002.0308] # LATIN CAPITAL LETTER A WITH DIAERESIS; QQCM
A: 0041  ; [.0E33.0020.0008.0041] # LATIN CAPITAL LETTER A

According to : Difference between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci collations in MariaDB/MySQL? the weights used for utf8mb4_unicode_ci are based on UCA 4.0.0 because the emoji do not appear in there, the mapped weight is 0xfffd

If you need case insensitive compares and sorts for regular letters along with emoji then this problem is solved using utf8mb4_unicode_520_ci:

SELECT
  WEIGHT_STRING('๐ŸŒฎ' COLLATE 'utf8mb4_unicode_520_ci'),
  WEIGHT_STRING('๐ŸŒถ' COLLATE 'utf8mb4_unicode_520_ci')

there will also get different weights for those emoji 0xfbc3f32e and 0xfbc3f336.

Somnifacient answered 14/12, 2016 at 16:54 Comment(5)
This is incredible. Switching encoding to utf8mb4_bin in the CREATE TABLEs above made the rest of the test queries work exactly as expected. Thanks so much. Any further insight into this would be appreciated. โ€“ Prosperus
No wonder binary collation fixes the issue (that's what it's meant for) but I can't understand why would two entirely different emojis be considered as case variations of the same character. I doubt it's intentional. โ€“ Subeditor
@ÁlvaroGonzález a similar reason why Ä, Á and A are the same, even if they might have different pronunciation and meaning. My first though was, that they are treated as equal, because they are all in the category food, but its more likely that the ci just checks if they are emoji. โ€“ Somnifacient
So... Collation database doesn't have information about them so they get assigned a generic common weight thus become "equal"? โ€“ Subeditor
8.0 will usher in utf8mb4_0900_ai_ci, based on UCA 9.0.0. โ€“ Stantonstanway
S
2

Don't need to go to weights. Do something like this to see whether two characters (or strings) are equal.

mysql> SELECT '๐ŸŒฎ' = '๐ŸŒถ' COLLATE utf8mb4_unicode_ci;
+--------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_ci |
+--------------------------------------+
|                                    1 |  1 = true, hence equal
+--------------------------------------+
1 row in set (0.00 sec)

mysql> SELECT '๐ŸŒฎ' = '๐ŸŒถ' COLLATE utf8mb4_unicode_520_ci;
+------------------------------------------+
| '?' = '?' COLLATE utf8mb4_unicode_520_ci |
+------------------------------------------+
|                                        0 |  unequal
+------------------------------------------+
1 row in set (0.00 sec)
Stantonstanway answered 28/10, 2021 at 23:1 Comment(0)

© 2022 - 2024 โ€” McMap. All rights reserved.