Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these. So I expected following regex to catch the freaks.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})
and it didn't work...
The catch is that list in wiki does not cover full range of combining characters.
What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16)
= "e49" which in not within a range of combining, it falls into 'Private use'.
In C# they fall under UnicodeCategory.NonSpacingMark
and following script flushes them out:
[Test]
public void IsZalgo()
{
var zalgo = new[] { UnicodeCategory.NonSpacingMark };
File.Delete("IsModifyLike.html");
File.AppendAllText("IsModifyLike.html", "<table>");
for (var i = 0; i < 65535; i++)
{
var c = (char)i;
if (zalgo.Contains(Char.GetUnicodeCategory(c)))
{
File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n", i.ToString("X"), c, Char.GetUnicodeCategory(c), i));
}
}
File.AppendAllText("IsModifyLike.html", "</table>");
}
By looking at the table generated you should be able to see which ones do stack.
One range that is missing on wiki is 06D6-06DC
another 0730-0749
.
UPDATE:
Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})
The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above.
Hope this saves you some time.
How can we sanitize this?
-- Why? – Thirtythreeoverflow: hidden
. – Carlton