Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these. So I expected following regex to catch the freaks.
and it didn't work...
The catch is that list in wiki does not cover full range of combining characters.
What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16)
= "e49" which in not within a range of combining, it falls into 'Private use'.
In C# they fall under UnicodeCategory.NonSpacingMark
and following script flushes them out:
public void IsZalgo()
var zalgo = new[] { UnicodeCategory.NonSpacingMark };
File.AppendAllText("IsModifyLike.html", "<table>");
for (var i = 0; i < 65535; i++)
var c = (char)i;
if (zalgo.Contains(Char.GetUnicodeCategory(c)))
File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n", i.ToString("X"), c, Char.GetUnicodeCategory(c), i));
File.AppendAllText("IsModifyLike.html", "</table>");
By looking at the table generated you should be able to see which ones do stack.
One range that is missing on wiki is 06D6-06DC
another 0730-0749
Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range.
The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above.
Hope this saves you some time.
How can we sanitize this?
-- Why? – Thirtythreeoverflow: hidden
. – Carlton