First let's address your "55357 method" – and why it works for many emoji characters.
In Cocoa, an NSString
is a collection of unichar
s, and unichar
is just a typealias for unsigned short
which is the same as UInt16
. Since the maximum value of UInt16
is 0xffff
, this rules out quite a few emoji from being able to fit into one unichar
, as only two out of the six main Unicode blocks used for emoji fall under this range:
These blocks contain 113 emoji, and an additional 66 emoji that can be represented as a single unichar
can be found spread around various other blocks. However, these 179 characters only represent a fraction of the 1126 emoji base characters, the rest of which must be represented by more than one unichar
.
Let's analyse your code:
unichar unicodevalue = [text characterAtIndex:0];
What's happening is that you're simply taking the first unichar
of the string, and while this works for the previously mentioned 179 characters, it breaks apart when you encounter a UTF-32 character, since NSString
converts everything into UTF-16 encoding. The conversion works by substituting the UTF-32 value with surrogate pairs, which means that the NSString
now contains two unichar
s.
And now we're getting to why the number 55357, or 0xd83d
, appears for many emoji: when you only look at the first UTF-16 value of a UTF-32 character you get the high surrogate, each of which have a span of 1024 low surrogates. The range for the high surrogate 0xd83d
is U+1F400–U+1F7FF, which starts in the middle of the largest emoji block, Miscellaneous Symbols and Pictographs (U+1F300–U+1F5FF), and continues all the way up to Geometric Shapes Extended (U+1F780–U+1F7FF) – containing a total of 563 emoji, and 333 non-emoji characters within this range.
So, an impressive 50% of emoji base characters have the the high surrogate 0xd83d
, but these deduction methods still leave 384 emoji characters unhandled, along with giving false positives for at least as many.
So, how can you detect whether a character is an emoji or not?
I recently answered a somewhat related question with a Swift implementation, and if you want to, you can look at how emoji are detected in this framework, which I created for the purpose of replacing standard emoji with custom images.
Anyhow, what you can do is extract the UTF-32 code point from the characters, which we'll do according to the specification:
- (BOOL)textView:(UITextView *)textView shouldChangeTextInRange:(NSRange)range replacementText:(NSString *)text {
// Get the UTF-16 representation of the text.
unsigned long length = text.length;
unichar buffer[length];
[text getCharacters:buffer];
// Initialize array to hold our UTF-32 values.
NSMutableArray *array = [[NSMutableArray alloc] init];
// Temporary stores for the UTF-32 and UTF-16 values.
UTF32Char utf32 = 0;
UTF16Char h16 = 0, l16 = 0;
for (int i = 0; i < length; i++) {
unichar surrogate = buffer[i];
// High surrogate.
if (0xd800 <= surrogate && surrogate <= 0xd83f) {
h16 = surrogate;
continue;
}
// Low surrogate.
else if (0xdc00 <= surrogate && surrogate <= 0xdfff) {
l16 = surrogate;
// Convert surrogate pair to UTF-32 encoding.
utf32 = ((h16 - 0xd800) << 10) + (l16 - 0xdc00) + 0x10000;
}
// Normal UTF-16.
else {
utf32 = surrogate;
}
// Add UTF-32 value to array.
[array addObject:[NSNumber numberWithUnsignedInteger:utf32]];
}
NSLog(@"%@ contains values:", text);
for (int i = 0; i < array.count; i++) {
UTF32Char character = (UTF32Char)[[array objectAtIndex:i] unsignedIntegerValue];
NSLog(@"\t- U+%x", character);
}
return YES;
}
Typing "😎" into the UITextView
writes this to console:
😎 contains values:
- U+1f60e
With that logic, just compare the value of character
to your data source of emoji code points, and you'll know exactly if the character is an emoji or not.
P.S.
There are a few "invisible" characters, namely Variation Selectors and zero-width joiners, that also should be handled, so I recommend studying those to learn how they behave.