How to protect against diacritics such as Zalgo text

Asked 15/8, 2012 at 23:51 Answered 20/4, 2016 at 11:38

Solved c#unicode user-input diacritics zalgo

huh?

The character pictured above was tweeted a few months ago by Mikko Hyppönen, a computer security expert known for his work on computer viruses and TED talks on computer security. In respect for SO, I will only post an image of it, but you get the idea. It's obviously not something you'd want spreading around your website and freaking out visitors.

Upon further inspection, the character appears to be a letter of the Thai alphabet combined with over 87 diacritics (is there even a limit?!). This got me thinking about security, localization, and how one might handle this sort of input. My searching lead me to this question on Stack, and in turn a blog post from Michael Kaplan on stripping diacritics. In it, he demonstrates how one can decompose a string into its "base" characters (simplified here for the sake of brevity):

StringBuilder sb = new StringBuilder();
foreach (char c in "façade".Normalize(NormalizationForm.FormD))
{
    if (char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
        sb.Append(c);
}
Response.Write(sb.ToString()); // facade

I can see how that this is would be useful in some cases, but in terms of user input, it would be stripping out ALL diacritics. As Kaplan points out, removing the diacritics in some languages can completely change the meaning to the word. This begs the question: How does one permit some diacritics in user input/output, but exclude others extreme cases such as Mikko Hyppönen's über character?

Elbaelbart answered 15/8, 2012 at 23:51 Comment(3)

Whitelist through a static class/utility class? And it deserves to go on programmers.stackexchange.com. – Rudelson 15/8, 2012 at 23:53

@MonsterTruck, fair enough, but whitelist what exactly? These are Unicode characters I am talking about. – Elbaelbart 16/8, 2012 at 0:3

You could set a maximum number of diacritics per base character. Pick a value high enough so that Vietnamese and Greek are still okay, but low enough to reject the insane cases. – Saccharide 16/8, 2012 at 0:8

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.

Whang answered 16/8, 2012 at 8:38 Comment(0)

I think I found a solution using NormalizationForm.FormC instead of NormalizationForm.FormD. According to the MSDN:

[FormC] Indicates that a Unicode string is normalized using full canonical decomposition, followed by the replacement of sequences with their primary composites, if possible.

I take that to mean that it decomposes characters to their base form, then recomposes them based on a set of rules that remain consistent. I gather this is useful for comparison purposes, but in my case it works perfect. Characters like ü, é, and Ä are decomposed/recomposed accurately, while the bogus characters fail to recompose, and thus remain in their base form:

enter image description here

Elbaelbart answered 16/8, 2012 at 8:34 Comment(3)

Requiring only composed characters is OK if you want to limit strings to historically commonly used characters - Unicode includes composed characters for all characters that were composed in a legacy encoding, for compatibility. However new additions to Unicode may only be available in a decomposed form. – Whang 16/8, 2012 at 8:49

Suggest checking for SpacingCombiningMark or EnclosingMark as well as NonSpacingMark, to get other combiners. Also iterating on char will go over UTF-16 code units, so you won't be able to check characters outside the Basic Multilingual Plane for which you'll see only the surrogates. Suggest using a regex to find and replace character classes over the entire string at once. – Whang 16/8, 2012 at 8:52

Thanks for the info! If this only works on historically commonly used characters, then setting a cap of 2-8 combiners sounds like a much better solution! To further your point, this method reduces the Tibetan symbol down to ཧ. Try explaining that to a Tibetan monk! – Elbaelbart 16/8, 2012 at 14:2

Here's a regex that should fish out all the Zalgo including ones bypassed in 'normal' range.

([\u0300-\u036F\u1AB0-\u1AFF\u1DC0-\u1DFF\u20D0-\u20FF\uFE20-\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})

The hardest bit is to identify them, once you have done that - there's a multitude of solutions.

Hope this saves you some time.

Buffum answered 20/4, 2016 at 11:38 Comment(0)

Recommended topics

Hot tags