What's up with these Unicode combining characters and how can we filter them?
Asked Answered
B

4

93

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้

These recently showed up in facebook comment sections.

How can we sanitize this?

Burnt answered 2/5, 2012 at 13:34 Comment(16)
Haven't you asked this question before? (Honest question.)Jut
Those are most definitely not asciiArmond
If I had I wouldn't ask it again.Burnt
Sorry, my bad, changed tags to "unicode" .Burnt
try translating these using translate.google.comEnd
@AshwiniChaudhary I have done this and what should be the expected output ? It didn't change much...Gynophore
Why the closing votes? It's a programming-related question, as I want to know how to sanitize this type of input so the comment sections on my website will not be the 13 years old's playground...Burnt
How can we sanitize this? -- Why?Thirtythree
กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิ"so the comment sections on my website will not be the 13 years old's playground." Actually without sanitization one posting these characters can make the comment above it unreadable, which is not at all a pleasent user experience.Burnt
Shouldn't we actually consider it a browser bug? In my opinion, the browser should enlarge the containing box so that all text including the accents fits in and doesn't overflow over/under another boxesExorcise
@pjotr It's definetly not a browser bug. If you want the characters not to overflow the containing box you can simply solve that with CSS (overflow:hidden;)...Burnt
@Cristy: Great point about overflow: hidden.Carlton
Another post about this particular display issue (just related, not a duplicate): What's the character encoding used?Unparliamentary
Based on this answer: #7119615 It DOES look like it may be a browser problem, or even OS. There is a problem with Thai Unicode.Chlorosis
Related: How does Zalgo text work?Phyliciaphylis
As a note, it seems that stackoverflow fixed this issue with large unicode characters overlapping other text.Burnt
C
82

What's up with these unicode characters?

That's a character with a series of combining characters. Because the combining characters in question want to go above the base character, they stack up (literally). For instance, the case of

ก้้้้้้้้้้้้้้้้้้้้

...it's an ก (Thai character ko kai) (U+0E01) followed by 20 copies of the Thai combining character mai tho (U+0E49).

How can we sanitize this?

You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. More information, code sheets, etc. at unicode.org.

BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts...

Carlton answered 2/5, 2012 at 13:42 Comment(6)
Wouldn't you just delete repeated copies of the same combining codepoint back to back into a single copy? When would you ever need to combine the same codepoint onto a base codepoint more than once?Hollow
@RemyLebeau: "When would you ever need to combine the same codepoint onto a base codepoint more than once?" I don't know, I know very, very little about how you write other languages -- Thai, for instance. I wouldn't be at all surprised to find out that more than one of the same code point was valid in some. But doing that doesn't reduce the complexity; you still need one of the Unicode tables for figuring out which ones are combining characters.Carlton
I made your page accept the unicode string from the url e.g. jsbin.com/erajer/7/…Sadye
JavaScript library to easily remove Unicode combining marks from strings: mths.be/stripcombiningmarksBeesley
JavaScript uses UTF-16 with « surrogate pairs »Manhunt
@dolmen: UTF-16 always has the possibility of surrogate pairs. What you mean is that JavaScript tolerates invalid sequences, where (of course) UTF-16 does not.Carlton
P
17

If you have a regex engine with decent Unicode support, it's trivial to sanitize this kind of strings. In Perl, for example, you can remove all but the first combining mark from every (user-perceived) character like this:

#!/usr/bin/perl
use strict;
use utf8;

binmode(STDOUT, ':utf8');

my $string = "กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้ กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้";
$string =~ s/(\p{Mark})\p{Mark}+/$1/g; # Strip excess combining marks
print("$string\n");

This will print:

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้

Phyliciaphylis answered 12/3, 2013 at 18:33 Comment(4)
I can't read Tibetan, but I'm concerned that this brute force approach may remove functionality from the way the language is designed. I've seen unicode that has legitimate use-cases of more than one combining mark. Arabic is a good example. I'll try to remember to run this by my Tibetan co-workers.Chlorosis
You're right, there are certainly cases where multiple combining marks are legitimate. But you can easily change the regex to allow a certain maximum of marks.Phyliciaphylis
Upvoted because it does answer the 'how do you sanitize this' question. But I think this would be a maintenance nightmare.Chlorosis
Also, the RE just removes adjacent duplication. It would not clean up, say: <base><macron><overline><macron><overline>.... So, if your text needs multiple different combining characters, it will pass through fine; and malicious text could still be built.Scharaga
C
14

"How can we sanitize this" is best answered above by T.J Crowder

However, I think sanitization is the wrong approach, and Cristy has it right with overflow:hidden on the css containing element.

At least, that's how I'm solving it.

Chlorosis answered 12/3, 2013 at 18:0 Comment(0)
E
7

Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these. So I expected following regex to catch the freaks.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})

and it didn't work...

The catch is that list in wiki does not cover full range of combining characters.

What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) = "e49" which in not within a range of combining, it falls into 'Private use'.

In C# they fall under UnicodeCategory.NonSpacingMark and following script flushes them out:

    [Test]
    public void IsZalgo()
    {
        var zalgo = new[] { UnicodeCategory.NonSpacingMark };

        File.Delete("IsModifyLike.html");
        File.AppendAllText("IsModifyLike.html", "<table>");
        for (var i = 0; i < 65535; i++)
        {
            var c = (char)i;
            if (zalgo.Contains(Char.GetUnicodeCategory(c)))
            {


                File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n",  i.ToString("X"), c, Char.GetUnicodeCategory(c), i));

            }
        }
        File.AppendAllText("IsModifyLike.html", "</table>");
    }

By looking at the table generated you should be able to see which ones do stack. One range that is missing on wiki is 06D6-06DC another 0730-0749.

UPDATE:

Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})

The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above.

Hope this saves you some time.

Enoch answered 17/3, 2016 at 12:38 Comment(6)
I would say, not to spam this spam!Outrush
@PraveenKumar Would you care to elaborate on what you mean?Enoch
I appreciate your answer, but this is a lost answered question. So why to add new answers unnecessarily? It is just my view. Moreover, your answer is not JavaScript, right?Outrush
@PraveenKumar It uncovers why normal zalgo validation ([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,}) does not work. Don't you find it interesting that stacking unicode is not limited to whats on wiki? What do you mean by 'lost answered question'? EDIT: You might find it odd to add answer to 3 year old question, but since it took me a while to figure out why this type of zalgo worked I couldn't let such knowledge to go to waste. Next guy will save some time.Enoch
@PraveenKumar the question does not state a language, and posting a new answer on an old question is completely appropriate if the old answers were deficient in some way. Unfortunately I do not have enough experience with this problem, or it would get an upvote from me.Cleanser
This RE has the benefit of catching mixed combining characters, with the drawback of never allowing a base that properly does need more than one combining character.Scharaga

© 2022 - 2024 — McMap. All rights reserved.