Regular expression preg_quote symbols are not detected
Asked Answered
J

3

36

I have a dictionary of swear words in the database, and the following works great

preg_match_all("/\b".$f."(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

$t is the input text and simply, $f = preg_quote("punk"); "punk" is from the database dictionary, so at this point in the loop the expression is as follows

preg_match_all("/\bpunk(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

preg_quote replaces symbols eg. # with \\# so that the expression is escaped, but when the dictionary is checking eg. "F@CK" or "A$$" these symbols are not detected in the input string with the above expression, I have both a$$ and f@ck in the dictionary, but they do not work. If I remove preg_quote() on the word, the regular expression is invalid as these symbols are not escaped.

Any suggestions on how I can detect "a$$" ???

Edit:

So I guess the expression that is not working as intended would be eg.

preg_match_all("/\bf\@ck(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);

Which should find f@ck in $t

UPDATE:

This is my usage, simply put; if there are matches in $m replace them with "\*\*\*\*", this whole block is inside a loop through each word in the dictionary, $f is the dictionary word and $t is the input

$f = preg_quote($f);
preg_match_all("/\b$f(?:ing|er|es|s)?\b/si",$t,$m,PREG_SET_ORDER);
if (count($m) > 0) {
     $t = preg_replace("/(\b$f(?:ing|er|es|s)?\b)/si","\*\*\*\*\*",$t);
}

UPDATE: Behold, the var_dump:

preg_quote($f) = string(5) "a\$\$"
$t = string(18) "You're such an a$$"
expression = string(29) "/\ba\$\$(?:ing|er|es|s)?\b/si"

UPDATE: This is only happening when words end with a symbol. I tested "a$$hole" and it’s fine, but "a$$" doesn't work.

ANOTHER UPDATE: Try this simplified version, $words being a make-shift dictionary

$words = array("a$$","asshole","a$$hole","f@ck","f#ck","f*ck");
$text = "Input whatever you feel like here eg. a$$";

foreach ($words as $f) {
   $f = preg_quote($f,"/");
   $text = preg_replace("/\b".$f."(?:ing|er|es|s)?\b/si",
                         str_repeat("*",strlen($f)),
                        $t);
}

I should expect to see "Input whatever you feel like here eg. \*\*\*" as a result.

Juan answered 23/5, 2011 at 11:37 Comment(15)
Can you include how you are using preg_quote() in your example code?Halette
$f = preg_quote($f); like that :)Juan
Your code works fine for me. Can you show us the string you're testing it on? Or maybe show us the whole dictionary cycle code, maybe the problem isn't in preg_match_all and preg_quote...Anglican
Reminds me of the Scunthorpe problem.Misdeed
@Misdeed Yeah, not worried about incorrectly finding profanity in eg. assess, that's totally the client's problem, just need it to work :) Besides, the \b makes sure we're talking about full words hereJuan
don't forget to set the second param for preg_quote - the delimitating character. In your case, '/'Miniature
Thanks @skippychalmers, didn't know about that but it hasn't helped :)Juan
@Prof83 pants. Hmm. Why are you using preg_match and preg_replace? Can you not just use preg_replace and compare strings before and after to determine if anything was matched?Miniature
Var_dump out this: "/\b$f(?:ing|er|es|s)?\b/si" after you've preg_quote'd $f.Miniature
...and var_dump your $t tooAnglican
@Skippy, because the block of code does a lot more :) and felt it unneccessary to post it all, so pretend its just preg_replaceJuan
This is only happening when words end with a symbol. I tested "a$$hole" and its fine, but "a$$" doesn't workJuan
This is not possible. See my answer for why. Ass for "a$$\b" not working, remember that that is asserting that the dollar sign has a word character following it.Odilo
@Prof83 var dump the preg_quote output.Miniature
You'd be better off training a bayesian filter to categorise postings as "good" or "bad" based on the words and characters in the posting. Then make it so bad postings don't instantly get posted but require a review. Use of unusual unicode characters would then end up being flagged as likely bad postings.Result
O
191

Cannot Be Done

I'm sorry, but this “problem” is truly impossible to solve. Consider these:

  • ꜰᴜᴄᴋ   is U+A730.1D1C.1D04.1D0B, "\N{LATIN LETTER SMALL CAPITAL F}\N{LATIN LETTER SMALL CAPITAL U}\N{LATIN LETTER SMALL CAPITAL C}\N{LATIN LETTER SMALL CAPITAL K}"
  • ᶠᵘᶜᵏ   is U+1DA0.1D58.1D9C.1D4F, "\N{MODIFIER LETTER SMALL F}\N{MODIFIER LETTER SMALL U}\N{MODIFIER LETTER SMALL C}\N{MODIFIER LETTER SMALL K}"
  • 𝒻𝓊𝒸𝓀   is U+1D4BB.1D4CA.1D4B8.1D4C0, "\N{MATHEMATICAL SCRIPT SMALL F}\N{MATHEMATICAL SCRIPT SMALL U}\N{MATHEMATICAL SCRIPT SMALL C}\N{MATHEMATICAL SCRIPT SMALL K}"
  • 𝖋𝖚𝖈𝖐   is U+1D58B.1D59A.1D588.1D590, "\N{MATHEMATICAL BOLD FRAKTUR SMALL F}\N{MATHEMATICAL BOLD FRAKTUR SMALL U}\N{MATHEMATICAL BOLD FRAKTUR SMALL C}\N{MATHEMATICAL BOLD FRAKTUR SMALL K}"
  • 𝓕 𝒰 𝒞 𝒦   is U+1D4D5.1D4B0.1D49E.1D4A6, "\N{MATHEMATICAL BOLD SCRIPT CAPITAL F}\N{MATHEMATICAL SCRIPT CAPITAL U}\N{MATHEMATICAL SCRIPT CAPITAL C}\N{MATHEMATICAL SCRIPT CAPITAL K}"
  • ⓕ ⓤ ⓒ ⓚ   is U+24D5.24E4.24D2.24DA, "\N{CIRCLED LATIN SMALL LETTER F}\N{CIRCLED LATIN SMALL LETTER U}\N{CIRCLED LATIN SMALL LETTER C}\N{CIRCLED LATIN SMALL LETTER K}"
  • Γ̵𐌵ᏟᏦ   is U+393.335.10335.13DF.13E6, "\N{GREEK CAPITAL LETTER GAMMA}\N{COMBINING SHORT STROKE OVERLAY}\N{GOTHIC LETTER QAIRTHRA}\N{CHEROKEE LETTER TLI}\N{CHEROKEE LETTER TSO}"
  • ƒμɕѤ   is U+192.3BC.255.464, "\N{LATIN SMALL LETTER F WITH HOOK}\N{GREEK SMALL LETTER MU}\N{LATIN SMALL LETTER C WITH CURL}\N{CYRILLIC CAPITAL LETTER IOTIFIED E}"
  • Г̵ЦСК   is U+413.335.426.421.41A, "\N{CYRILLIC CAPITAL LETTER GHE}\N{COMBINING SHORT STROKE OVERLAY}\N{CYRILLIC CAPITAL LETTER TSE}\N{CYRILLIC CAPITAL LETTER ES}\N{CYRILLIC CAPITAL LETTER KA}"
  • ғᵾȼƙ   is U+493.1D7E.23C.199, "\N{CYRILLIC SMALL LETTER GHE WITH STROKE}\N{LATIN SMALL CAPITAL LETTER U WITH STROKE}\N{LATIN SMALL LETTER C WITH STROKE}\N{LATIN SMALL LETTER K WITH HOOK}"
  • ϜυϚΚ   is U+3DC.3C5.3DA.39A, "\N{GREEK LETTER DIGAMMA}\N{GREEK SMALL LETTER UPSILON}\N{GREEK LETTER STIGMA}\N{GREEK CAPITAL LETTER KAPPA}"
  • ЖↃUᆿ   is U+416.2183.55.11BF, "\N{CYRILLIC CAPITAL LETTER ZHE}\N{ROMAN NUMERAL REVERSED ONE HUNDRED}\N{LATIN CAPITAL LETTER U}\N{HANGUL JONGSEONG KHIEUKH}"
  • ʞɔnɟ   is U+29E.254.6E.25F, "\N{LATIN SMALL LETTER TURNED K}\N{LATIN SMALL LETTER OPEN O}\N{LATIN SMALL LETTER N}\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}"

It Gets Worse

And if you think those are easy, just try coping with all of these:

𝓕 00 Ↄ ʞ, F ᵾ ⒞ K, K ⓒ Ц ⒡ , 𝖋 𝖀 K 𝒸, ғ ∞ Ϛ k, f 𝓊 Ꮯ K, ⓕ oo ɔ ⓚ , ɟ ⒰ ¢ K, 𝒻 𝖚 ȼ 𝖐, 𝕱 Ù ȼ ⒦ , f 𝒰 ⒞ ƙ, F 𐌵 ᶜ 𝕶, F ∞ 𝒞 Ж , 𝕱 @ Ꮯ 𝓀, ɟ ᵘ 𝒞 𝕶, F Ц ¢ 𝒦, f oo Ꮯ ʞ, 𝕱 oo ¢ Ж , 𝕱 υ ᶜ Κ , Ϝ ú * ʞ, ꜰ 𝖚 c K, ƒ ᵘ ȼ k, 𝖋 U ȼ 𝕶, Ж ɔ μ ƒ, F ⓤ ⒞ k, ƒ 𝖚 C ƙ, ғ 00 ɔ Ѥ, ƒ U c ᴋ, 𝕱 ∞ Ꮶ ⓒ , ꜰ 𝓊 ᴄ ⒦ , 𝕱 ⒰ Ꮯ Ѥ, ꜰ ᴜ 𝒞 ⒦ , F 𝒰 𝖈 ʞ, f 00 𝖈 𝓀, ғ u С K, f 𐌵 ɔ Κ , f μ Ↄ K, ɟ 𝖚 c ʞ, f 𝖚 Ↄ 𝖐, F μ ¢ 𝓀, ᆿ 𝖀 ᴄ ⒦ , Κ ¢ oo ɟ, ᶠ μ ᶜ Ѥ, ᶠ ⓤ Ꮯ Ж , 𝒦 ⒞ ᵘ F, F @ C ⓚ , Ѥ ᴄ u F, ⒡ ᵾ C k, ƒ μ ᶜ ᴋ, F 𝒰 C 𝓀, f ᵘ ¢ ᵏ, ᆿ 00 𝒸 𝕶, ꜰ υ ȼ K, Ϝ 𝓊 ȼ К , 𝕱 oo ɕ ᴋ, ғ 𝒰 Ꮯ ᴋ, ꜰ n 𝒸 K, ꜰ μ Ϛ К , F ∞ ȼ 𝖐, ⒡ 𐌵 Ↄ Κ , ƒ 𝖚 ⒞ 𝒦, ᶠ U C Ꮶ, ᶠ υ Ↄ ƙ, 𝓕 𝓊 C 𝓀, Ϝ U 𝒸 Ѥ, Ϝ U Ↄ 𝓀, 𝒻 U ⒞ ᵏ, F @ C К , ғ ᴜ 𝖈 ᴋ, ⒡ U 𝒸 К , ɟ U * ᵏ, 𝖋 Ц c Κ , ғ U Ↄ 𝕶, ƒ ⒰ 𝒞 ᵏ, ғ 𝖚 * K, 𝖋 n 𝕮 ⓚ , ᶠ 00 С К , 𝖋 Ц 𝒞 k, ƙ c Ц ᶠ, 𝕱 ⒰ Ѥ 𝖈, ꜰ ǔ ᴄ ⒦ , F 𝒰 Ↄ 𝓀, 𝒦 𝖈 υ ꜰ, 𝖋 𝖚 * ᵏ, 𝖋 00 𝕮 Ж , Κ C 𝖚 𝖋, ᶠ U С K, ꜰ 𝖀 𝖈 Κ , ɟ U ᶜ ⓚ , 𝒻 ∞ ȼ ᴋ, ƒ U К ć, ƒ υ ȼ ᴋ, ⒡ ∞ Ж ɕ, 𝖋 ᵘ 𝖈 ᵏ, F U Ϛ ʞ, ⓕ 𐌵 𝕮 Ж , 𝕱 𝒰 𝓀 Ↄ, Ϝ n * K, 𝓕 oo c ⓚ , ƒ U ¢ ʞ, ƒ u C ʞ, K ¢ μ ⒡ , ɟ ⒰ K ɔ, F U c k, F Ц 𝖈 ⓚ , 𝒻 U ᴋ ɔ, 𝖋 𝖀 Ꮯ 𝒦, 𝒻 𐌵 𝖈 ⓚ , ⓕ 𝖚 C К , ɟ ᵾ * ⒦ , ᶠ ᵘ ⒞ ⒦ , ƒ ⒰ ᴄ ᵏ, ⒡ ⒰ С K, 𝓕 ⒰ * ᴋ, ᆿ ∞ ʞ ɕ, 𝒻 n * Ѥ, Ϝ μ ᴄ 𝒦, k ć ᵘ ƒ, 𝓕 ᵘ ɕ 𝖐, ɟ Ц Ꮶ ᴄ, 𝓕 ᵾ ⒞ ᵏ, ғ ᵘ 𝒸 ᵏ, 𝖋 ᵾ * Ѥ, F 𝖚 Ꮯ K, ғ ⓤ 𝕮 ᴋ, ƒ u ɕ 𝖐, ƙ c ⒰ F, 𝒻 𝒰 ⓒ Κ , K ᶜ Ц 𝕱, ɟ 𝖚 c ⒦ , ƒ @ c Κ , Ϝ Ц ȼ Ḱ, ⒡ ᵘ 𝒞 ⒦ , ɟ ᵾ Ѥ ¢, F 𝖀 Ↄ 𝒦, Ϝ ᴜ 𝖐 𝖈, Ϝ 𝖀 ⒞ 𝖐, 𝕱 U Ꮯ ʞ, ƒ υ Ꮯ ᵏ, F ᵾ Ꮯ Κ , Ϝ ᵘ ⓒ ʞ, 𝓕 ⓤ ᶜ ƙ, ᆿ 𝒰 ⒞ 𝕶, f 𝖀 Ↄ Ѥ, 𝖋 U 𝒞 K, Ϝ ᴜ * 𝓀, ꜰ @ ⓒ ʞ, ƒ u ⓒ 𝒦, f U ⒞ k, 𝕱 00 ᴄ Ѥ, 𝒻 υ С K, F ᴜ ᴄ 𝕶, ⓕ oo Ↄ ⓚ , ⒡ ᵘ ɕ 𝓀, ⓕ υ ᴄ Κ , ᆿ U Ꮯ 𝕶, 𝒻 𝖀 Ꮯ Ꮶ, 𝖋 𐌵 Ć 𝓀, 𝓕 Ц ɕ К , f @ Ↄ ⓚ , ᴋ ᶜ U ꜰ, 𝓕 ᴜ c ⒦ , F ᵘ C 𝒦, 𝒻 00 𝖈 Ꮶ, ꜰ 00 𝖈 К , Ϝ 𝖚 Ϛ ᵏ, F 𐌵 c Ѥ, ⓕ oo Ↄ K, f ᵾ С ᵏ, ⓕ Ц c 𝒦, 𝓕 𐌵 c Ж , ⓕ 𝓊 𝒞 ƙ, ⓚ C n ғ, ɟ U ȼ 𝕶, 𝒻 00 K ȼ, 𝒻 𐌵 ᴄ 𝖐, 𝒻 Ц C 𝓀, 𝖋 Ц ¢ 𝓀, Ϝ ᵘ c k, ⒡ 𐌵 ¢ k, ƒ ⓤ ⓚ Ↄ, 𝒻 𐌵 𝕮 k, ƒ U Ↄ K, 𝓕 𝖀 ᴄ Ꮶ, ᆿ ⓤ 𝕮 ⒦ , Ж ɔ U 𝖋, ƒ υ * ᴋ, ƒ 𝓊 𝒞 k, 𝓕 U С ⒦ , 𝒻 𝖚 C Ж , ƒ μ Ꮯ ƙ, ⓕ n ᴄ ⒦ , ⓕ μ ⓒ Ж , ⒡ 00 ɕ 𝖐, 𝕱 ᴜ ᶜ 𝒦, ᆿ Ù Ж 𝖈, ⒦ ȼ U 𝖋, k C ⓤ ᆿ, Ϝ n ȼ ᵏ, ᴋ ȼ ᵾ ɟ, F 𝖀 ȼ Ѥ, ғ ⒰ ȼ 𝒦, f U Ж ⒞ , F ῠ 𝒸 ᵏ, F u 𝒸 Κ , F 00 ȼ 𝕶, ꜰ μ Ϛ Ꮶ, ᆿ 𝖀 𝒞 K, ⒡ n Ↄ Ж , F @ 𝒞 ƙ, ᶠ ὺ 𝒸 К , 𝒻 U C ᵏ, F U 𝖈 ⒦ , 𝒻 00 Ↄ 𝕶, ᶠ 𝖚 c К , ғ ⓤ 𝒞 𝒦, 𝓕 ⓤ 𝖈 Κ , 𝒻 U 𝒸 Ж , ⒡ 𝖀 ɔ Ꮶ, ⓚ ɔ 𝓊 f, 𝒻 U C K, F @ C Ѥ, ғ ᴜ С k, ɟ u * ƙ, ⓕ ᵾ ɕ 𝒦, 𝕱 00 ȼ K, 𝒻 υ 𝓀 𝖈, ƒ ⒰ * ʞ, ⓕ U Ↄ Ж , ꜰ U ȼ ƙ, ⒡ u С ⒦ , ꜰ ᴜ 𝕮 Ќ, ᆿ μ 𝒞 ⒦ , ⓕ @ ᴄ К , ᶠ υ ɔ ᵏ, ƙ Ↄ oo ꜰ, F ᴜ 𝕮 𝒦, 𝓕 ⒰ C ᵏ, 𝖋 U 𝒸 ƙ, ƒ ∞ C Ꮶ, 𝒻 ⒰ * K, 𝒻 u Ↄ ᴋ, ᆿ U ⓒ 𝓀, ᆿ U Ꮶ 𝕮, 𝓕 n 𝒦 𝖈, ƒ Ц C ƙ, ⒦ 𝖈 𝒰 ꜰ, K ¢ ᵘ f, 𝕱 ⒰ 𝖈 Ꮶ, 𝓀 ᴄ 00 𝖋, Ϝ U 𝒞 k, 𝕱 u ¢ ⒦ , 𝕱 𝓊 * Ѥ, ƒ 𝖀 С ᴋ, 𝒻 𝖀 C Ꮶ, 𝖋 @ 𝕮 Κ , ʞ С 𝖀 ᶠ, 𝖋 ᵾ Ϛ Ꮶ, ᶠ ⒰ ɔ 𝒦, F Ц ⒞ ʞ, ⒡ ⒰ К ɔ, ɟ υ ¢ 𝕶, Ѥ ȼ U ᆿ, 𝒻 ᴜ Ↄ ʞ, ғ 𝓊 * K, 𝒻 𝒰 ᴄ ʞ, F 𝖀 𝖈 ʞ, 𝒻 @ ȼ 𝒦, 𝒻 ⒰ * 𝖐, 𝒻 ᵾ ȼ 𝒦, F 𐌵 ¢ Ѥ, ꜰ ⓤ ƙ Ϛ, ⓕ 00 c ʞ, 𝕱 00 Ϛ K, 𝖋 υ Ↄ Κ , ꜰ μ ⓒ Ж , 𝒻 ᵘ Ϛ ʞ, Ϝ ᵘ Ↄ ᵏ, ⒡ ᵾ Ꮯ 𝒦, Ϝ ⒰ ȼ Ѥ, ƒ n 𝒞 Ѥ, ᆿ μ ⓒ k, 𝖋 Ц ɕ Κ , ғ μ 𝕮 Ѥ, f ⓤ Ꮯ 𝖐, ᵏ 𝕮 μ ƒ, ᵏ С 𝖚 𝓕, ᆿ ∞ 𝖈 𝒦, ғ ᵘ Ꮯ 𝓀, ƒ μ Ↄ k, f oo K ȼ, ɟ 𝓊 𝕶 С , ꜰ n 𝖈 K, 𝒻 00 𝖈 ᵏ, ᶠ μ ⓒ 𝓀, 𝖐 c ∞ Ϝ, ᆿ Ц Ć ⒦ , 𝕱 ᵘ ᴄ 𝒦, F 00 𝕮 ⓚ , ᶠ @ ȼ К , ...

And that’s not all: there are at least a bazingatillion more where those came from. Do you see now why this fundamentally cannot be done?

Full Disclosure

Because I don't believe in security through obscurity, here's the program that generates all those:

#!/usr/bin/env perl
#
# unifuck - print infinite permutations of fuck in unicode aliases
#
# Tom Christiansen <[email protected]>
# Mon May 23 09:37:27 MDT 2011

use strict;
use warnings;
use charnames ":full";

use Unicode::Normalize;

binmode(STDOUT, ":utf8");

our(@diddle, @fuck, %fuck); # initted down below
while (my($f,$u,$c,$k) = splice(@fuck, 0, 4)) {
    $fuck{F}{$f}++;
    $fuck{U}{$u}++;
    $fuck{C}{$c}++;
    $fuck{K}{$k}++;
} 

my @F = keys %{ $fuck{F} };
my @U = keys %{ $fuck{U} };
my @C = keys %{ $fuck{C} };
my @K = keys %{ $fuck{K} };

while (1) { 
    my $f = $F[rand @F];
    my $u = $U[rand @U];
    my $c = $C[rand @C];
    my $k = $K[rand @K];

    for ($f,$u,$c,$k) {  
        next if length > 1;
        next if /\p{EA=W}/;
        next if /\pM/;
        next if /\p{InEnclosedAlphanumerics}/;
        s/$/$diddle[rand @diddle]/          if rand(100) < 15;
        s/$/\N{COMBINING ENCLOSING KEYCAP}/ if rand(100) <  1;
    }

    if    (             0) {                                       }
    elsif (rand(100) <  5) {     $u        = q(@)                  } 
    elsif (rand(100) <  5) {        $c     = q(*)                  } 
    elsif (rand(100) < 10) {       ($c,$k) = ($k,$c)               } 
    elsif (rand(100) < 15) { ($f,$u,$c,$k) = reverse ($f,$u,$c,$k) }

    print NFC("$f $u $c $k\n");
}

BEGIN {

    # ok to have repeats in each position, since they'll be counted only once
    # per unique strings
    @fuck = (

        "\N{LATIN CAPITAL LETTER F}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{LATIN CAPITAL LETTER C}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER U}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{INFINITY}",
        "\N{LATIN SMALL LETTER C}",
        "\N{LATIN SMALL LETTER K}",

        "\N{LATIN SMALL LETTER F}",
        "\N{LATIN SMALL LETTER O}\N{LATIN SMALL LETTER O}",
        "\N{LATIN SMALL LETTER C}",
        "\N{KELVIN SIGN}",

        "\N{LATIN SMALL LETTER F}",
        "\N{DIGIT ZERO}\N{DIGIT ZERO}",
        "\N{CENT SIGN}",
        "\N{LATIN CAPITAL LETTER K}",

        "\N{LATIN LETTER SMALL CAPITAL F}",
        "\N{LATIN LETTER SMALL CAPITAL U}",
        "\N{LATIN LETTER SMALL CAPITAL C}",
        "\N{LATIN LETTER SMALL CAPITAL K}",

        "\N{MODIFIER LETTER SMALL F}",
        "\N{MODIFIER LETTER SMALL U}",
        "\N{MODIFIER LETTER SMALL C}",
        "\N{MODIFIER LETTER SMALL K}",

        "\N{MATHEMATICAL SCRIPT SMALL F}",
        "\N{MATHEMATICAL SCRIPT SMALL U}",
        "\N{MATHEMATICAL SCRIPT SMALL C}",
        "\N{MATHEMATICAL SCRIPT SMALL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR CAPITAL K}",

        "\N{MATHEMATICAL BOLD FRAKTUR SMALL F}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL U}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL C}",
        "\N{MATHEMATICAL BOLD FRAKTUR SMALL K}",

        "\N{MATHEMATICAL BOLD SCRIPT CAPITAL F}",
        "\N{MATHEMATICAL SCRIPT CAPITAL U}",
        "\N{MATHEMATICAL SCRIPT CAPITAL C}",
        "\N{MATHEMATICAL SCRIPT CAPITAL K}",

        "\N{CIRCLED LATIN SMALL LETTER F}",
        "\N{CIRCLED LATIN SMALL LETTER U}",
        "\N{CIRCLED LATIN SMALL LETTER C}",
        "\N{CIRCLED LATIN SMALL LETTER K}",

        "\N{PARENTHESIZED LATIN SMALL LETTER F}",
        "\N{PARENTHESIZED LATIN SMALL LETTER U}",
        "\N{PARENTHESIZED LATIN SMALL LETTER C}",
        "\N{PARENTHESIZED LATIN SMALL LETTER K}",

        "\N{GREEK CAPITAL LETTER GAMMA}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{GOTHIC LETTER QAIRTHRA}",
        "\N{CHEROKEE LETTER TLI}",
        "\N{CHEROKEE LETTER TSO}",

        "\N{LATIN SMALL LETTER F WITH HOOK}",
        "\N{GREEK SMALL LETTER MU}",
        "\N{LATIN SMALL LETTER C WITH CURL}",
        "\N{CYRILLIC CAPITAL LETTER IOTIFIED E}",

        "\N{CYRILLIC CAPITAL LETTER GHE}\N{COMBINING SHORT STROKE OVERLAY}",
        "\N{CYRILLIC CAPITAL LETTER TSE}",
        "\N{CYRILLIC CAPITAL LETTER ES}",
        "\N{CYRILLIC CAPITAL LETTER KA}",

        "\N{CYRILLIC SMALL LETTER GHE WITH STROKE}",
        "\N{LATIN SMALL CAPITAL LETTER U WITH STROKE}",
        "\N{LATIN SMALL LETTER C WITH STROKE}",
        "\N{LATIN SMALL LETTER K WITH HOOK}",

        "\N{GREEK LETTER DIGAMMA}",
        "\N{GREEK SMALL LETTER UPSILON}",
        "\N{GREEK LETTER STIGMA}",
        "\N{GREEK CAPITAL LETTER KAPPA}",

        "\N{HANGUL JONGSEONG KHIEUKH}",
        "\N{LATIN CAPITAL LETTER U}",
        "\N{ROMAN NUMERAL REVERSED ONE HUNDRED}",
        "\N{CYRILLIC CAPITAL LETTER ZHE}",

        "\N{LATIN SMALL LETTER DOTLESS J WITH STROKE}",
        "\N{LATIN SMALL LETTER N}",
        "\N{LATIN SMALL LETTER OPEN O}",
        "\N{LATIN SMALL LETTER TURNED K}",

        "\N{FULLWIDTH LATIN CAPITAL LETTER F}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER U}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER C}",
        "\N{FULLWIDTH LATIN CAPITAL LETTER K}",

    );

    @diddle = (
        "\N{COMBINING GRAVE ACCENT}",
        "\N{COMBINING ACUTE ACCENT}",
        "\N{COMBINING CIRCUMFLEX ACCENT}",
        "\N{COMBINING TILDE}",
        "\N{COMBINING BREVE}",
        "\N{COMBINING DOT ABOVE}",
        "\N{COMBINING DIAERESIS}",
        "\N{COMBINING CARON}",
        "\N{COMBINING CANDRABINDU}",
        "\N{COMBINING INVERTED BREVE}",
        "\N{COMBINING GRAVE TONE MARK}",
        "\N{COMBINING ACUTE TONE MARK}",
        "\N{COMBINING GREEK PERISPOMENI}",
        "\N{COMBINING FERMATA}",
        "\N{COMBINING SUSPENSION MARK}",
    );

}
Odilo answered 23/5, 2011 at 15:46 Comment(9)
I remember using all sorts of Unicode tricks to get around profanity filters a few years ago, and getting banned anyway. Good times.Allinclusive
Now, if only SO's gods would read and understand this answer, and stop the stupid censoring.Coburn
@Coburn well, until then, we can still use our cyrillic letters when our problemmas really need sоlving. Here, take one: рҏѓґоьꙑӏеҽҿӗӎ; neither of р,о,е looks any different from their latin homoglyphs.Tigress
I don't give a... You already give all of them.Osei
Even in ascii you can substitute "ph" for "f" etc.Result
Just so it's said, someone who considers profanity to be that harmful (and has the time and patience -- and/or a big enough army of censors -- to examine every one of the million or so possible Unicode code points and billions of combiniations of accents) could conceivably build a list of characters that each character "looks like" and/or "sounds like" or "masks". It'd be possible to eliminate every one of the variants listed here. It'd be outrageously tedious, though, and not worth it unless/until certain collections of letters are truly threatening civilization as we know it.Daedalus
@Daedalus Actually, this is not as hard as you think, given the existence of confusables.txt, confusablesSummary.txt, and confusablesWholeScript.txt from Unicode Technical Report #36: “Unicode Security Considerations”.Odilo
@tchrist: Those lists do help with very-similar (combinations of )?characters, but that's a tiny subset of the problem a profanity filter would have to deal with. They won't help much with ⓕ⒰cʞ, for example, considering the characters aren't exactly "confusable". (A quick glance fails to find and ʞ, for example, probably because the likelihood of someone legitimately misreading one as the other is slim.) After replacement, you'd end up with ⓕ(u)cʞ, which would likely get past any filter that doesn't either do additional translation or arbitrarily block everything outside of ASCII.Daedalus
actually it can be done, I know, beacause I did it, 7 years before this question was written. Not in PHP of course, and it is more involved than a regex, but it is a real-time algorithm that was 100% effective for the words it was trained to exclude, all alternative phonetic spellings were caught.Azoic
H
4

\b checks for a word boundary. According to http://www.regular-expressions.info/wordboundaries.html:

There are three different positions that qualify as word boundaries:

  • Before the first character in the string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in the string, where one is a word character and the other is not a word character.

"Word characters" are letters, digits, and underscores, so in the string "a$$", the word boundary occurs after the "a", not after the second "$".

You will probably need to explicitly specify the characters you consider to be "word boundaries" by using a class (e.g., [- '"]).

Halette answered 23/5, 2011 at 14:5 Comment(3)
I need to provide you a better output result, its still not working despite how promising your answer sounds, is there a way to get you the dictionary and class i am working with?Juan
Add a snipt.org URL to your OP.Halette
He may improve his patterns, but he'll never solve this problem. It cannot be done: see my answer for why.Odilo
A
2

Now, when you said that it doesn't work at the end of the word I see the problem. $@ or any other such special characters aren't part of the word (so \b breaks the word after 'a' in case of 'a$$' if it isn't followed by any other letters in the input string). I suggest using [^a-z] to mark the end of the word to fix it.

preg_match_all("/\b".$f."(?:ing|er|es|s)?[^a-z]/si",$t,$m,PREG_SET_ORDER);
Anglican answered 23/5, 2011 at 11:54 Comment(3)
I need to provide you a better output result, its still not working despite how promising your answer sounds, is there a way to get you the dictionary and class i am working with?Juan
It's easy to give bazingatillions of strings that will sneak past this approach. It is doomed to fail.Odilo
Ok but hangon, you're telling me it's impossible to preg_replace("a$$","***","you a$$"); ??? That doesnt sound right to me, i am not trying to find characters that resemble an "S", i am trying to run off a given set of words in the dictionary, if someone posts "a##hole" and its not in the dictionary, then we'll add it into the dictionary???Juan

© 2022 - 2024 — McMap. All rights reserved.