MySQL, UTF-8 and Emoji characters

The issue is wether the db has a diacritical insensitive compare. The other issue is composed characters, ï can be expressed as either one unicode character or two forming a surrogate pair. There are methods to convert a string to a pre-composed or decomposed form: precomposedStringWith* and decomposedStringWith*.

It seems that MySQL supports two forms of unicode ucs2 (that is an older form that was supersede by utf16) which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 support BMP and supplementary characters (read emoji). See MySQL Unicode Character Sets.

Here is some code and results to demonstrate the different unicode byte representations.
Unicode is a 21 bit encoding system.
UTF32 directly represents the code points and clearly demonstrates decomposed surrogate pairs.
UTF8 and UTF16 require one or more bytes to represent a unicode character.

NSLog(@"character: %@", @"Å");
NSLog(@"decomposedStringWithCanonicalMapping UTF8:  %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"decomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"decomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog(@"precomposedStringWithCanonicalMapping UTF8:  %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"precomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"precomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog(@"character: %@", @"😱");
NSLog(@"dataUsingEncoding UTF8:  %@", [@"😱" dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"dataUsingEncoding UTF16: %@", [@"😱" dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"dataUsingEncoding UTF32: %@", [@"😱" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

// For some surrogate pairs there is no other form

NSString *aReverse = [[NSString alloc] initWithBytes:"\xD8\x3C\xDD\x70\x00" length:4 encoding:NSUTF16BigEndianStringEncoding];
NSLog(@"character: %@", aReverse);
NSLog(@"dataUsingEncoding UTF8:  %@", [aReverse dataUsingEncoding:NSUTF8StringEncoding]);
NSLog(@"dataUsingEncoding UTF16: %@", [aReverse dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
NSLog(@"dataUsingEncoding UTF32: %@", [aReverse dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

NSLog output:

character: Å
decomposedStringWithCanonicalMapping UTF8:  <41cc8a>   
decomposedStringWithCanonicalMapping UTF16: <0041030a>   
decomposedStringWithCanonicalMapping UTF32: <00000041 0000030a>   

precomposedStringWithCanonicalMapping UTF8:  <c385>   
precomposedStringWithCanonicalMapping UTF16: <00c5>   
precomposedStringWithCanonicalMapping UTF32: <000000c5>   

character: 😱
dataUsingEncoding UTF8:  <f09f98b1>   
dataUsingEncoding UTF16: <d83dde31>   
dataUsingEncoding UTF32: <0001f631>   

character: 🅰
dataUsingEncoding UTF8:  <f09f85b0>
dataUsingEncoding UTF16: <d83cdd70>
dataUsingEncoding UTF32: <0001f170>

Recommended topics

Hot tags