php find emoji [update existing code]
Asked Answered
P

5

5

I'm trying to detect emoji in my php code, and prevent users entering it.

The code I have is:

if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) > 0)
{
    //warning...
}

But doesn't work for all emoji. Any ideas?

Pedestrian answered 12/5, 2012 at 13:13 Comment(0)
S
10
if(preg_match('/\xEE[\x80-\xBF][\x80-\xBF]|\xEF[\x81-\x83][\x80-\xBF]/', $value) 

You really want to match Unicode at a character level, rather than trying to keep track of UTF-8 byte sequences. Use the u modifier to treat your UTF-8 string on a character basis.

The emoji are encoded in the block U+1F300–U+1F5FF. However:

  • many characters from Japanese carriers' ‘emoji’ sets are actually mapped to existing Unicode symbols, eg the card suits, zodiac signs and some arrows. Do you count these symbols as ‘emoji’ now?

  • there are still systems which don't use the newly-standardised Unicode emoji code points, instead using ad-hoc ranges in the Private Use Area. Each carrier had their own encodings. iOS 4 used the Softbank set. More info. You may wish to block the entire Private Use Area.

eg:

function unichr($i) {
    return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}

if (preg_match('/['.
    unichr(0x1F300).'-'.unichr(0x1F5FF).
    unichr(0xE000).'-'.unichr(0xF8FF).
']/u'), $value) {
    ...
}
Synecology answered 14/5, 2012 at 13:41 Comment(2)
Hi, thank you for that idea, but it doesn't work for all emoji :) Is there a way to add support for this one: ❤ ? And probably some others ? It perfectly works for iOS emoji now. Thank you.Pedestrian
Well that's the question of what you count as emoji. ❤ existed as a general symbol long before anyone conceived of emoji. If you want to block just the symbols that have been re-used for emoji, look at the Emoji For PHP link above and pick out all the U+2xxx code points used. Alternatively consider blocking a wider range of symbols if you don't need them, eg unichr(0x2190).'-'.unichr(0x27FF).Synecology
A
2

From wikipedia:

The core emoji set as of Unicode 6.0 consists of 722 characters, of which 114 characters map to sequences of one or more characters in the pre-6.0 Unicode standard, and the remaining 608 characters map to sequences of one or more characters introduced in Unicode 6.0.[4] There is no block specifically set aside for emoji – the new symbols were encoded in seven different blocks (some newly created), and there exists a Unicode data file called EmojiSources.txt that includes mappings to and from the Japanese vendors' legacy character sets.

Here is the mapping file. There are 722 lines in the file, each one representing one of the 722 emoticons.

It seems like this is not an easy thing to do because there is not a specific block set aside for emoji. You need to adjust your regex to cover all of the emoji unicodes.

You could match an individual unicode like so:

\x{1F30F}

1F30F is the unicode for an emoticon of a globe.

Sorry I don't have a full answer for you, but this should get you headed in the right direction.

Albumenize answered 12/5, 2012 at 17:58 Comment(0)
C
1

The right answer is to detect where you have an assigned code point in the Miscellaneous_Symbols_And_Pictographs block. In Perl, you’d use

 /\p{Assigned}/ && \p{block=Miscellaneous_Symbols_And_Pictographs}/

or just

/\P{Cn}/ && /\p{Miscellaneous_Symbols_And_Pictographs}/

which you should combine those into one pattern with

/(?=\p{Assigned})\p{Miscellaneous_Symbols_And_Pictographs}/

I don’t recall whether the PCRE library that PHP uses gives you access to the requisite Unicode character properties. My recollection is that it’s pretty weak in that particular area. I think you only have Unicode script properties and general categories. Sigh.

Sometimes you just have to use the real thing.

For lack of decent Unicode support, you may have to enumerate the block yourself:

/(?=\P{Cn})[\x{1F300}-\x{1F5FF}]/

Looks like a maintenance nightmare to me, full of magic numbers.

Coelom answered 13/5, 2012 at 1:0 Comment(3)
Seguence is too large at offset 19 :(Pedestrian
@Pedestrian I have no idea what that might mean. It’s a legal range. Can you not specify the emoji range as /[\x{1F300}-\x{1F5FF}]/]?Coelom
it works now ... but doesn't recognize all emojis :( When I use the ones on iOS 'Emoji' Keyboard, it doesn't detect them ...Pedestrian
W
1

Here's my solution, which is a simpler (thanks to php7) version of bobince's answer.

<?php
if (preg_match("/[\u{1f300}-\u{1f5ff}\u{e000}-\u{f8ff}]/u", $text)) {
  // echo "😭 oh no. Emojis not allowed!";
}

EDIT Following the suggestion of bobnice's answer, this regex excludes both the actual emoji range (1f300 - 1f5ff) and the other range that bobnice proposed you might be interested in blocking.

EDIT 2 to be clear: this simpler format is possible in PHP 7.0+. If you're still on an (now unsupported) version of PHP you'll need to use the original answer.

Weide answered 19/1, 2021 at 14:20 Comment(5)
This answer is missing its educational explanation. The fact that bobince's answer has an explanation is not excuse to withhold one here.Dishabille
@Dishabille really? Linking within the same page to the currently accepted answer? I thought it respectful not to copy and paste or otherwise restate what another contributor has done perfectly well, and to show credit where it's due. I think people should read bobnice's answer; this is just a convenient update for php7.0+Weide
Then explain how your answer is more modern/simpler. Explain why someone should use yours and not bobince's. Something more than a snippet and a link.Dishabille
Alternatively, if you just want to drop an updated regex pattern, you can leave that as a comment under bobince's answer ...but I assume you want rep points for your contribution, so you may as well post a complete and explained answer.Dishabille
I thought I had done that in my opening sentence, but in case it was a bit cryptic I've explained it explicitly now. I don't massively care for points, but I contribute to make SE useful for me (I regularly find my own questions!) and others. You're right that including all of the info in one post is the most convenient way.Weide
P
-2

That's what I came up with today. It's probably not a good solution for this problem, but at least it works ;)

if(iconv('Windows-1250', 'UTF-8', iconv('UTF-8', 'Windows-1250', $value)) != $value)
Pedestrian answered 13/5, 2012 at 13:45 Comment(5)
You’re on Microsoft???? That’s probably the bug: Microsoft has lots of problems dealing with Unicode, especially the full Unicode range you’d need to handle emoji, since those are outside the BMP. You should have put WINDOWS in the tags. Couldn’t you just use a normal Unix system instead? Macs are cheap when you factor in their standards compliance, which is what you need here. Linux is even cheaper.Coelom
I've found out that's not a good 'workaround' ... It doesn't work for £, and some other characters ...Pedestrian
The Windows-1250 conversion suggested otherwise. But I don’t think this is something for which you should need to call iconv at all. Perhaps I misunderstand the problem.Coelom
It should be as you say (no iconv() call), but don't know how to do that in php ... I just want to detect emojis in string, and let the user know he has to remove them :)Pedestrian
This removes everything that's not encodable in cp1250 Central European. So that certainly kills emoji, but also the vast majority of the rest of Unicode...Synecology

© 2022 - 2024 — McMap. All rights reserved.