What's up with these Unicode combining characters and how can we filter them?

Asked 2/5, 2012 at 13:34 Answered 17/3, 2016 at 12:38

Solved unicode sanitize combining-marks zalgo

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้

These recently showed up in facebook comment sections.

How can we sanitize this?

Burnt answered 2/5, 2012 at 13:34 Comment(16)

Haven't you asked this question before? (Honest question.) – Jut 2/5, 2012 at 13:35

Those are most definitely not ascii – Armond 2/5, 2012 at 13:35

If I had I wouldn't ask it again. – Burnt 2/5, 2012 at 13:36

Sorry, my bad, changed tags to "unicode" . – Burnt 2/5, 2012 at 13:37

try translating these using translate.google.com – End 2/5, 2012 at 13:38

@AshwiniChaudhary I have done this and what should be the expected output ? It didn't change much... – Gynophore 2/5, 2012 at 13:40

Why the closing votes? It's a programming-related question, as I want to know how to sanitize this type of input so the comment sections on my website will not be the 13 years old's playground... – Burnt 2/5, 2012 at 13:51

How can we sanitize this? -- Why? – Thirtythree 2/5, 2012 at 21:58

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิ"so the comment sections on my website will not be the 13 years old's playground." Actually without sanitization one posting these characters can make the comment above it unreadable, which is not at all a pleasent user experience. – Burnt 2/5, 2012 at 22:21

Shouldn't we actually consider it a browser bug? In my opinion, the browser should enlarge the containing box so that all text including the accents fits in and doesn't overflow over/under another boxes – Exorcise 3/5, 2012 at 11:7

@pjotr It's definetly not a browser bug. If you want the characters not to overflow the containing box you can simply solve that with CSS (overflow:hidden;)... – Burnt 3/5, 2012 at 11:29

@Cristy: Great point about overflow: hidden. – Carlton 4/5, 2012 at 6:53

Another post about this particular display issue (just related, not a duplicate): What's the character encoding used? – Unparliamentary 4/5, 2012 at 20:29

Based on this answer: #7119615 It DOES look like it may be a browser problem, or even OS. There is a problem with Thai Unicode. – Chlorosis 14/3, 2013 at 23:5

Related: How does Zalgo text work? – Phyliciaphylis 28/1, 2014 at 1:43

As a note, it seems that stackoverflow fixed this issue with large unicode characters overlapping other text. – Burnt 14/10, 2017 at 10:44

What's up with these unicode characters?

That's a character with a series of combining characters. Because the combining characters in question want to go above the base character, they stack up (literally). For instance, the case of

ก้้้้้้้้้้้้้้้้้้้้

...it's an ก (Thai character ko kai) (U+0E01) followed by 20 copies of the Thai combining character mai tho (U+0E49).

How can we sanitize this?

You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. More information, code sheets, etc. at unicode.org.

BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts...

Carlton answered 2/5, 2012 at 13:42 Comment(6)

Wouldn't you just delete repeated copies of the same combining codepoint back to back into a single copy? When would you ever need to combine the same codepoint onto a base codepoint more than once? – Hollow 2/5, 2012 at 20:43

@RemyLebeau: "When would you ever need to combine the same codepoint onto a base codepoint more than once?" I don't know, I know very, very little about how you write other languages -- Thai, for instance. I wouldn't be at all surprised to find out that more than one of the same code point was valid in some. But doing that doesn't reduce the complexity; you still need one of the Unicode tables for figuring out which ones are combining characters. – Carlton 3/5, 2012 at 8:7

I made your page accept the unicode string from the url e.g. jsbin.com/erajer/7/… – Sadye 12/3, 2013 at 16:4

JavaScript library to easily remove Unicode combining marks from strings: mths.be/stripcombiningmarks – Beesley 8/1, 2014 at 8:55

JavaScript uses UTF-16 with « surrogate pairs » – Manhunt 26/7, 2016 at 14:9

@dolmen: UTF-16 always has the possibility of surrogate pairs. What you mean is that JavaScript tolerates invalid sequences, where (of course) UTF-16 does not. – Carlton 26/7, 2016 at 14:40

If you have a regex engine with decent Unicode support, it's trivial to sanitize this kind of strings. In Perl, for example, you can remove all but the first combining mark from every (user-perceived) character like this:

#!/usr/bin/perl
use strict;
use utf8;

binmode(STDOUT, ':utf8');

my $string = "กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้ กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้";
$string =~ s/(\p{Mark})\p{Mark}+/$1/g; # Strip excess combining marks
print("$string\n");

This will print:

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้

Phyliciaphylis answered 12/3, 2013 at 18:33 Comment(4)

I can't read Tibetan, but I'm concerned that this brute force approach may remove functionality from the way the language is designed. I've seen unicode that has legitimate use-cases of more than one combining mark. Arabic is a good example. I'll try to remember to run this by my Tibetan co-workers. – Chlorosis 12/3, 2013 at 19:18

You're right, there are certainly cases where multiple combining marks are legitimate. But you can easily change the regex to allow a certain maximum of marks. – Phyliciaphylis 12/3, 2013 at 19:45

Upvoted because it does answer the 'how do you sanitize this' question. But I think this would be a maintenance nightmare. – Chlorosis 15/3, 2013 at 0:8

Also, the RE just removes adjacent duplication. It would not clean up, say: <base><macron><overline><macron><overline>.... So, if your text needs multiple different combining characters, it will pass through fine; and malicious text could still be built. – Scharaga 10/7, 2018 at 15:47

"How can we sanitize this" is best answered above by T.J Crowder

However, I think sanitization is the wrong approach, and Cristy has it right with overflow:hidden on the css containing element.

At least, that's how I'm solving it.

Chlorosis answered 12/3, 2013 at 18:0 Comment(0)

Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these. So I expected following regex to catch the freaks.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})

and it didn't work...

The catch is that list in wiki does not cover full range of combining characters.

What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) = "e49" which in not within a range of combining, it falls into 'Private use'.

In C# they fall under UnicodeCategory.NonSpacingMark and following script flushes them out:

    [Test]
    public void IsZalgo()
    {
        var zalgo = new[] { UnicodeCategory.NonSpacingMark };

        File.Delete("IsModifyLike.html");
        File.AppendAllText("IsModifyLike.html", "<table>");
        for (var i = 0; i < 65535; i++)
        {
            var c = (char)i;
            if (zalgo.Contains(Char.GetUnicodeCategory(c)))
            {


                File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n",  i.ToString("X"), c, Char.GetUnicodeCategory(c), i));

            }
        }
        File.AppendAllText("IsModifyLike.html", "</table>");
    }

By looking at the table generated you should be able to see which ones do stack. One range that is missing on wiki is 06D6-06DC another 0730-0749.

UPDATE:

Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})

The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above.

Hope this saves you some time.

Enoch answered 17/3, 2016 at 12:38 Comment(6)

I would say, not to spam this spam! – Outrush 17/3, 2016 at 12:40

@PraveenKumar Would you care to elaborate on what you mean? – Enoch 17/3, 2016 at 12:41

I appreciate your answer, but this is a lost answered question. So why to add new answers unnecessarily? It is just my view. Moreover, your answer is not JavaScript, right? – Outrush 17/3, 2016 at 12:42

@PraveenKumar It uncovers why normal zalgo validation ([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,}) does not work. Don't you find it interesting that stacking unicode is not limited to whats on wiki? What do you mean by 'lost answered question'? EDIT: You might find it odd to add answer to 3 year old question, but since it took me a while to figure out why this type of zalgo worked I couldn't let such knowledge to go to waste. Next guy will save some time. – Enoch 17/3, 2016 at 12:45

@PraveenKumar the question does not state a language, and posting a new answer on an old question is completely appropriate if the old answers were deficient in some way. Unfortunately I do not have enough experience with this problem, or it would get an upvote from me. – Cleanser 21/3, 2016 at 13:25

This RE has the benefit of catching mixed combining characters, with the drawback of never allowing a base that properly does need more than one combining character. – Scharaga 10/7, 2018 at 15:54

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้

Recommended topics

Hot tags