How can Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ text be prevented?
Asked Answered
S

6

23

I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to:

a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or,

b) reduced to maximum 8 consecutive characters (the maximum encountered in actual languages)?

EDIT: In the meantime I found a completely differently phrased question ("How to protect against... diacritics?"), which is essentially the same as this one. I made its title more explicit so others will find it as well.

Saintly answered 9/3, 2014 at 0:47 Comment(18)
why should a a chat or forum software prevent vertical rubbish automatically, when it cannot do the same with horizontal rubbish?Nihil
Y̒̌͛́̓̀͊ͫ͌ͦo͊͂ͤ̊̒̆͊ͪ̋ͯͥ͌ͧ͑̑̂͐͗̏u̇̽̿͋̋́̅̐̄ͮ̿͆̚r͊ͥͣ͂̑ͩ̒̑̋̊̅ ͬ̔̍̾̓ͩ̇͒ͯ͗͐͐ͧ̍͊̚c͋̈́̂̽ͬ͒͊ͣͤ͊̋͛̿͒̚̚oͩͫ͛̂̄̐̽̑ͬ͑̍̃ͯm̉̈́̾ͨ̆̊ͨͪ͌mͫ̾͋ͨͤ̈́͑́͐́eͮ͐̍̌ͬ͛̃̃̿ͪ̌͂n̊͋ͫ͆t̊͊ͪ̌́͆ ̎̔̉ͮ̋̋͐̐ͮ͛̈̆̉̈́ͣ̎̐̏̚i͆̌͆̃̾̽ͥ̎͊́s̑̌̓̆͊́ͦ͆̍̇̌̀̈̓̈́ͪ̚ ̍̀͌ͩͮ́̿́̓̈́̍ͣ̔ṁ̋̑̉ͤi̒̌̿̔ͣ̇͐ͭͫͬ̎͊ͬ͊̓s͗̽ͦ̄͋ͤ͆͊ͬ̈́̂̌ͦ͒̈́̓ͪ̏gͣ͆̃͛ͨͩ̚u͆͆̄ͬ̍ͯiͬͩ̎̑d̍ͩ̐ͫ̍e͗ͪ̀ͥͨ̀͌̒ͦͩͣ̎ͯ͂̔ͤd̆ͭ͆.̑̃͂̆̀̈́̽ͭ̂ͮ̓ If that was not demo enough, here's why: 1) a crap regular comment affects only itself, while a Zalgo one affects others. 2) Because it is possible to automatically filter out Zalgo, while automatically filtering out low-quality comments requires developing general AI.Saintly
possible duplicate of What's up with these Unicode characters?Ardoin
The question does not describe a programming problem. Rather, it asks “what should I set as the goal in programming when the purpose is to avoid annoying me/others with Zalgo?” Besides, it’s a fairly broad question. Which of the world’s 6,000 languages do you intend to consider, and do you think it’s OK to filter out characters in English text written in Normalization Form D? (E.g., “fiancé” can validly be written using a combining mark, though it usually isn’t.)Freya
@JukkaK.Korpela: Yes, it's a programming question because one of the answer is "Use the strip-combining-marks library. Yes, I think it's OK to filter out these characters from English text - nobody writes "fiancé" using them. PS: I'm the one who fought to have your answer to "How does Zalgo text work" recognized as the correct one, so you're welcome.Saintly
Edited the title of the "What's up with these Unicode characters" to something search queries would actually find, and voted to close my own question. To the two folks to votes to close because "it was unclear what I was asking" - if you don't understand the question, maybe it's not in your field and you should rather choose "Skip"? Plenty of people understood what I was asking.Saintly
@DanDascalescu I vote to keep open, if only for your enlightening demonstrative comment. Finding this kind of thing puts a smile on my face and that's worth more to me than normalizing SO.Acetylide
@iwein: you have restored my faith in StackOverflow, after experiences like these.Saintly
@hivert & whoever else didn't understand the question: as you can see, there are plenty of comments and even answers from people who did understand it. I've further edited the question for clarity and precision.Saintly
I really think this should be reopened. This question can be answered with code and I posted some code as an answer.Lynnet
@nwk: unfortunately those who get to decide to reopen or not, know much less than you do about the topic at hand. They just happen to have more points. That's just how StackOverflow works.Saintly
Code can be posted as answers, but this does not make it a programming question. It is a design question, as it leaves it open what should really be done. The question has now been edited to be somewhat more specific, partly on arbitrary grounds. But if the question is reformulated as a specific question, with a specification for what the program should do, it should be posted as a new question, tagged with the programming language(s) that would be used, and containing code written so far, with an explanation of why it’s unsatisfactory.Freya
@JukkaK.Korpela: so I should go to all the trouble to post a new question, only to have it closed again by you or some others, based on grounds you can always dig from your enormous rule book? No, thanks.Saintly
@JukkaK.Korpela: I don't know whether to agree in this case (I think there's a design and a programming question in this post, the most of the design part being in the first line) but I see your point about design. What would be the right Stack Exchange for this question as it stands and questions like it, Programmers?Lynnet
@nwk: I fail to see how this would be a design question. I'm asking about a character set. I've even added "JavaScript" as a tag to appease Jukka (whose Unicode work I greatly respect, BTW, and have been aware of since 2004), but the point is that I think we're looking for nothing more than a regexp character class.Saintly
@DanDascalescu: I should clarify: what sounds designy to me is the sentence ending with "how can a chat or forum software prevent that kind of annoyance?", not the rest of your question.Lynnet
You cannot prevent Zalgo... Ḧ̛̪̠́̌ͦ̔̄̐̓͗ͭ̒̀͗́̚ͅE̻̪͇͓͓͖͕̖͓̘͚̰̺͔̻̬͙͑͂̑ͫͧ̊̏ͨ͛ͯ̅̋͑ͤͤ̅̒͘͞ͅͅ ̧̢̡̩̥̯̤͚̤͍͓͙̳̞̦̓̓̇ͧ̎̐̓ͤ̀͜ͅC̦̫̗̠̝̅̀ͨ̊̕͝͝ͅŌ̷̝̝̰̞͓͎̫̖͚̲̟̽ͫ́͛̋̍̒ͦ̊̂̈ͤ͆͒͞ͅṂ̴̠̠̜̣̹ͥ̓̇͐̇ͬͣ͆̆̈́̚͡͝Ē̵̳̞̝̙͕ͬ͒ͮ̀͑͊̎͑̔̀̕͜͞Ş̶̡̛̠̠͙̱̣̝͔̻̻̩̬ͮ͑̀̒͂̐̑̋̚͘Abduct
In the context of an HTML page, a simpler solution than trying to filter out certain combining diacritical marks is to use the CSS property overflow: hidden. For example, if I inspect the td.comment-text elements on this page and add that style, they no longer visually overflow onto other comments.Shepp
L
19

Assuming you're very serious about this and want a technical solution you could do as follows:

  1. Split the incoming text into smaller units (words or sentences);
  2. Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);
  3. Train a machine learning algorithm to judge if it looks too "dark" and "busy";
  4. If the algorithm's confidence is low defer to human moderators.

This could be fun to implement but in practice it would likely be better to go to step four straight away.

Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to ZALGO_CHAR_CATEGORIES.

#!/usr/bin/env python
import unicodedata
import codecs

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),

Example input:

1
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
2
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
3

Output:

1
How does Zalgo text work?
2
How does Zalgo text work?
3

Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD. If THRESHOLD equals 0.5 it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.

#!/usr/bin/env python
from __future__ import division
import unicodedata
import codecs
import numpy

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
THRESHOLD = 0.5
DEBUG = True

def is_zalgo(s):
    if len(s) == 0:
        return False
    word_scores = []
    for word in s.split():
        cats = [unicodedata.category(c) for c in word]
        score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
        word_scores.append(score)
    total_score = numpy.percentile(word_scores, 75)
    if DEBUG:
        print total_score
    return total_score > THRESHOLD

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line

Sample output:

0.911483990148
True    Señor, could you or your fiancé explain, H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡

0.333333333333
False   Příliš žluťoučký kůň úpěl ďábelské ódy.  
Lynnet answered 9/3, 2014 at 8:49 Comment(5)
Appreciate the elaborate solution, but I was looking for a simple character range regular expression, or a library like strip-combining marks.Saintly
I wasn't quite sure how serious you were about looking for a solution (i.e., if you wanted something that's fun to play with vs. something you could plug in a forum today). I implemented two more practical solutions in Python; it was a fun little bit of research to figure this stuff out. Since this question is on hold right now I can't add my code as a separate answer, so I added it here.Lynnet
I have (professionally) come across international text VALIDLY containing characters belonging to the two character classes you are banning, and please be aware that a word in CJK easily consists of a SINGLE character (and also be aware that in several langauges words may NOT be separated by non-word characters).Nihil
@WalterTross: "Banned" is a misnomer in the case of the second code snippet because it doesn't actually ban those marks. I'll change that.Lynnet
@DanDascalescu Given that Regex is one of the ways in which Zalgo texts were generated, I would advise against trying so....https://mcmap.net/q/17499/-regex-match-open-tags-except-xhtml-self-contained-tagsDoughnut
I
13

Make the box overflow:hidden. It doesn't actually disable Zalgo text, but it prevents it from damaging other comments.

.comment {
  /* the overflow: hidden is what prevents one comment's combining marks from affecting its siblings */
  overflow: hidden;
  /* the padding gives space for any legitimate combining marks */
  padding: 0.5em;
  /* the rest are just to visually divide the three comments */
  border: solid 1px #ccc;
  margin-top: -1px;
  margin-bottom: -1px;
}
<div class=comment>The below comment looks awful.</div>
<div class=comment>H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡</div>
<div class=comment>The above comment looks awful.</div>
Idaho answered 7/4, 2017 at 20:10 Comment(2)
Highly practical suggestion. Validation measures such as ''.join((c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')) are resource intensive and the opposite of subtle.Seeseebeck
I think you mean "awful".Anthelion
A
6

A related question was asked before: https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented but it's interesting to go into prevention here.

In terms of preventing this you can choose several strategies:

  1. prevent combining diacritics entirely (and piss off many international users),
  2. filter out combining characters using whitelisting or blacklisting (and piss off a smaller percentage of international users)
  3. prevent a certain number of combining characters (and piss of an even smaller percentage of users)
  4. have a healthy moderator community (with all the downsides that has, see your question as an example here)
Acetylide answered 9/3, 2014 at 8:24 Comment(3)
"with all the downsides that has, see your question as an example here" - priceless :)Saintly
The smallest unit of text that is usually zalgoed is a line. Rather than the absolute number of combining characters you could look at their density (percentage) in each line.Lynnet
@Lynnet good trick, but I was thinking to disallow successive combining characters (meaning you can only reach a certain height/depth)Acetylide
S
4

You can get rid off Zalgo text in your application using strip-combining-marks by Mathias Bynens.

The module strip-combining-marks is available for browsers (via Bower) and Node.js applications (via npm).

Here is an example on how to use it with npm:

var stripCombiningMarks = require("strip-combining-marks");
var zalgoText = 'U̼̥̻̮͍͖n͠i͏c̯̮o̬̝̠͉̤d͖͟e̫̟̗͟ͅ';
var stripptedText = stripCombiningMarks(zalgoText); // "Unicode"
Sport answered 14/3, 2017 at 16:14 Comment(2)
For anyone coming here via Google, be aware that strip-combining-marks will trash some valid emojis. It turns out the blue and white number emojis use combining marks... emojipedia.org/keycap-digit-oneRocher
This could also ruin other valid Unicode characters that use combining marks. Quoth the Unicode FAQ, "...unless a precomposed character is used, it is encoded as U+0301 COMBINING ACUTE ACCENT. Similarly, the U+0308 COMBINING DIAERESIS may be used for diaeresis, trema, umlaut, as well as other, possibly unrelated uses."Odisodium
S
2

Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too.

$unZalgoText = iconv("UTF-8", "ISO-8859-1//IGNORE", $zalgoText);
Sharronsharyl answered 13/2, 2018 at 19:21 Comment(0)
H
1

Using RegExp to limit excessive combining marks

function removeExcessiveMarks(string) {
  string.replaceAll(/([\p{Mc}\p{Me}\p{Mn}]{2})[\p{Mc}\p{Me}\p{Mn}]+/gu, "$1");
}

Result

removeExcessiveMarks("Z͎̠͗ͣḁ̵͙̑l͖͙̫̲̉̃ͦ̾͊ͬ̀g͔̤̞͓̐̓̒̽o͓̳͇̔ͥ"); // Z͗ͣȃ̵l̉̃g̐̓o̔ͥ

Some unit tests

it("Should work with a zalgo text", () => {
  expect(removeExcessiveMarks("Z̸a̸͆l̸͆͐g̸͆͐̓o̸͆͐̓̈́")).toBe("Z̸a̸͆l̸͆g̸͆o̸͆");
});

it("Should work with arabic letter beh", () => {
  expect(removeExcessiveMarks("بٍٍّ")).toBe("بٍّ");
  expect(removeExcessiveMarks("ب\u0651\u0650\u0652\u0650")).toBe("ب\u0651\u0650");
});

it('Should work with "e" combined with 3 accents (total 3 accents)', () => {
  // Combining "e" with Grave Accent (U+0300), Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("e\u0300\u0301\u0303")).toBe("e\u0300\u0301");
});

it('Should work with "è" combined with 2 accents (total 3 accents)', () => {
  // Combining "è" with Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("è\u0301\u0303")).toBe("è\u0301\u0303");
});

Explanation

This regex handles combining characters defined by the "Mark" Unicode general category. These are characters that typically modify the preceding base character:

  • Mn (Non-Spacing): Usually used for accents and diacritics. Example: è (e + U+0300)
  • Mc (Spacing Combining): Usually used for diacritics that take up space. Example: Arabic Fatha بَ (ب + U+064E)
  • Me (Enclosing): Marks that surround the base character. Example: Combining Enclosing Circle a⃝ (a + U+20DD)

In some languages and writing systems, multiple diacritics are used in combination to accurately represent sounds and pronunciations. Although the number of combination marks is virtually infinite, more than 2 consecutive marks are generally not legitimate in most real-world scenarios. Feel free to change the RegExp selector to more than 2.

Note that emoji modifiers are kept as they belong to the Sk (Symbol, Modifier) category.

Unicode normalization

For more end-user consistency, I would recommend to apply the regex to the decomposed form NFD and then recompose using NFC:

function removeExcessiveMarks(string) {
  return string
    .normalize("NFD") // Decompose
    .replaceAll(/([\p{Mc}\p{Me}\p{Mn}]{2})[\p{Mc}\p{Me}\p{Mn}]+/gu, "$1")
    .normalize("NFC"); // Recompose
}
it('Should work with "e" combined with 3 accents (total 3 accents)', () => {
  // Combining "e" with Grave Accent (U+0300), Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("e\u0300\u0301\u0303")).toBe("è\u0301");
});

it('Should work with "è" combined with 2 accents (total 3 accents)', () => {
  // Combining "è" with Acute Accent (U+0301) and Tilde (U+0303)
  expect(removeExcessiveMarks("è\u0301\u0303")).toBe("è\u0301");
});
Hasson answered 10/7 at 7:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.