MySQL, UTF-8 and Emoji characters
Asked Answered
P

1

1

I'm working on an iOS app with a PHP+MySQL backend. The app has a chat section, which needs to support emoji. My tables are utf8_unicode_ci. If I don't call 'set names utf8' in my scripts, emoji it actually works - whatever is entered in the database, is returned to the clients as it should.

The problem is that this (if I understand it correctly) stores special characters incorrectly in the database, and this breaks string comparing (ie ï is no longer the same as i when comparing strings).

However, if I do call set names utf8, suddenly the emoji characters are inserted as a bunch of questionmarks.

Any suggestions on the proper way of handling this? Thanks!

Posen answered 17/6, 2014 at 0:11 Comment(0)
S
1

The issue is wether the db has a diacritical insensitive compare. The other issue is composed characters, ï can be expressed as either one unicode character or two forming a surrogate pair. There are methods to convert a string to a pre-composed or decomposed form: precomposedStringWith* and decomposedStringWith*.

It seems that MySQL supports two forms of unicode ucs2 (that is an older form that was supersede by utf16) which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 support BMP and supplementary characters (read emoji). See MySQL Unicode Character Sets.

Here is some code and results to demonstrate the different unicode byte representations.
Unicode is a 21 bit encoding system.
UTF32 directly represents the code points and clearly demonstrates decomposed surrogate pairs.
UTF8 and UTF16 require one or more bytes to represent a unicode character.

NSLog(@"character: %@", @"Å");
NSLog(@"decomposedStringWithCanonicalMapping UTF8:  %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"decomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"decomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog(@"precomposedStringWithCanonicalMapping UTF8:  %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"precomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"precomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog(@"character: %@", @"😱");
NSLog(@"dataUsingEncoding UTF8:  %@", [@"😱" dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"dataUsingEncoding UTF16: %@", [@"😱" dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"dataUsingEncoding UTF32: %@", [@"😱" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

// For some surrogate pairs there is no other form

NSString *aReverse = [[NSString alloc] initWithBytes:"\xD8\x3C\xDD\x70\x00" length:4 encoding:NSUTF16BigEndianStringEncoding];
NSLog(@"character: %@", aReverse);
NSLog(@"dataUsingEncoding UTF8:  %@", [aReverse dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"dataUsingEncoding UTF16: %@", [aReverse dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"dataUsingEncoding UTF32: %@", [aReverse dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog output:

character: Å
decomposedStringWithCanonicalMapping UTF8:  <41cc8a>   
decomposedStringWithCanonicalMapping UTF16: <0041030a>   
decomposedStringWithCanonicalMapping UTF32: <00000041 0000030a>   

precomposedStringWithCanonicalMapping UTF8:  <c385>   
precomposedStringWithCanonicalMapping UTF16: <00c5>   
precomposedStringWithCanonicalMapping UTF32: <000000c5>   

character: 😱
dataUsingEncoding UTF8:  <f09f98b1>   
dataUsingEncoding UTF16: <d83dde31>   
dataUsingEncoding UTF32: <0001f631>   

character: 🅰
dataUsingEncoding UTF8:  <f09f85b0>
dataUsingEncoding UTF16: <d83cdd70>
dataUsingEncoding UTF32: <0001f170>
Sacchariferous answered 17/6, 2014 at 0:48 Comment(4)
This seems to be really useful information. I can't verify at the moment because I can't upgrade my MySQL at the moment, but I am going to assume you are correct and accept your answer :) Thanks!Posen
I have an additional question if you don't mind. If I don't need string matching like I described, are there any other downsides to not calling set names utf8? I know the data is inserted incorrectly, but it works when displaying the data in the clients.Posen
Unfortunately I don't know anything about set names utf8, it has been years since I wrote code to work with MySQL. Do use Plane 1 code points for testing (emoji are in Plane 1). Also surrogated pairs but they probably only are issues for comparisons. I have added some code and results to the answer to demonstrate the different unicode byte representations.Sacchariferous
Wanted to inform you that I upgraded MySQL and changed my character sets to utf8mb4_unicode_ci - and all is well now. Thank you very much!Posen

© 2022 - 2024 — McMap. All rights reserved.