"bad words" filter [closed]
Asked Answered
B

9

54

Not very technical, but... I have to implement a bad words filter in a new site we are developing. So I need a "good" bad words list to feed my db with... any hint / direction? Looking around with google I found this one, and it's a start, but nothing more.

Yes, I know that this kind of filters are easily escaped... but the client will is the client will !!! :-)

The site will have to filter out both english and italian words, but for italian I can ask my colleagues to help me with a community-built list of "parolacce" :-) - an email will do.

Thanks for any help.

Balladry answered 23/8, 2008 at 19:17 Comment(8)
Obscenity filtering... a bad idea or a really intercoursing bad idea?Sale
team it up with a spellchecker, if you get more spelling errors post-censorship, you've messed up somewhere and can deal with itBeatitude
related: programmers.stackexchange.com/questions/143405/…Finnougric
Very few filters can detect the words "Shiτ" and "fucκ", though. Not even StackOverflow.Monomania
To everyone saying that this is pointless and/or stupid, consider that this kind of filtering could still be useful as one part of a larger system. Yes, it's probably a bad idea to find/replace or automatically reject based purely on a blacklist, but a filter could be used, for example, to send user-submitted content for manual approval/moderation. Or perhaps it could be be used to warn a user before submission that they may be banned if they post offensive material.Cady
This is great for web-based educational software to flag student responses that "contain profanity", which can then be relayed to the teachers for review. I created an ASCII folding map, in which I hand-mapped all 65,000+ Unicode code points to their closest visual ASCII equivalent if one exists. I then did the same for all permutations of 2, 3, and 4-character sequences using a visual similarity engine, to collapse them to their nearest single-character equivalent (e..g "\/\/" = "W", "|-|" = "H", "|_" = "L"), and then used an hierarchical temporal memory algorithm to recognize them instantly.Tabbitha
After much munging and collection: github.com/alvations/expletives/tree/masterProterozoic
Hi @triynko If you are willing to share code I would be interested. Interesting idea.Incision
M
39

I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.

<?php

/**
 * @author [email protected]
 **/

if($_GET['act'] == 'do')
 {
    $pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
    $pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
    $pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
    $pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
    $pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
    $pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
    $pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
    $pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
    $pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
    $pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
    $pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
    $pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
    $pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
    $pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
    $pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
    $pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
    $pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
    $pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
    $pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
    $pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
    $pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
    $pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
    $pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
    $pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
    $pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
    $pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
    $word = str_split(strtolower($_POST['word']));
    $i=0;
    while($i < count($word))
     {
        if(!is_numeric($word[$i]))
         {
            if($word[$i] != ' ' || count($word[$i]) < '1')
             {
                $word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
             }
         }
        $i++;
     }
    //$word = "/" . implode('', $word) . "/";
    echo implode('', $word);
 }

if($_GET['act'] == 'list')
 {
    $link = mysql_connect('localhost', 'username', 'password', '1');
    mysql_select_db('peoples');
    $sql = "SELECT word FROM filters";
    $result = mysql_query($sql, $link);
    $i=0;
    while($i < mysql_num_rows($result))
     {
        echo mysql_result($result, $i, 'word') . "<br />";
        $i++;
     }
     echo '<hr>';
 }
?>
<html>
    <head>
        <title>RegEx Generator</title>
    </head>
    <body>
        <form action='badword.php?act=do' method='post'>
            Word: <input type='text' name='word' /><br />
            <input type='submit' value='Generate' />
        </form>
        <a href="badword.php?act=list">List Words</a>
    </body>
</html>
Merca answered 23/8, 2008 at 21:27 Comment(4)
On't-day orget-day ig-pay atin-lay. Urse-cay ords-way are-ar ill-st ite-quay eadable-ray. (former owner of the AOL nick Itshay).Stereo
you mean "On't-day orget-fay"Hyposthenia
This is a great reference, thank you for that. In application, however, I'm not sure how changing "hamburger" to "[h H][a A @][m M][b B I3 l3 i3][u U v V][r R][g G 6][e E 3][r R]" is going to help filter profanity.Angary
@Angary Sometimes users attempt to bypass bad word filters by using other characters instead of the conventional letters; instead of A, one could use @ to say a bad word. However, it may also help by including lower- and upper-case characters. For example, you have a database of bad words, from which you pass this code on, and use it to detect a bad word even if it was misspelled or tweaked through these means.Collude
D
60

Beware of clbuttic mistakes.

"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"

Hmm. "clbuttic".

Google "clbuttic" - thousands of hits!

There's someone who call his car 'clbuttic'.

There are "Clbuttic Steam Engine" message boards.

Webster's dictionary - no help.

Hmm. What can this be?

HINT: People who make buttumptions about their regex scripts, will be embarbutted when they repeat this mbuttive mistake.

Decathlon answered 23/8, 2008 at 19:30 Comment(0)
O
40

Shutterstock has a Github repo with a list of bad words used for filtering.

You can check it out here: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

Overglaze answered 9/3, 2012 at 5:28 Comment(2)
It's a bit much though - "Mr Hands" is offensive apparently.Behn
The french DB is bad...Forked
M
39

I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.

<?php

/**
 * @author [email protected]
 **/

if($_GET['act'] == 'do')
 {
    $pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
    $pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
    $pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
    $pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
    $pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
    $pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
    $pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
    $pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
    $pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
    $pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
    $pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
    $pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
    $pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
    $pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
    $pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
    $pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
    $pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
    $pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
    $pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
    $pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
    $pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
    $pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
    $pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
    $pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
    $pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
    $pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
    $word = str_split(strtolower($_POST['word']));
    $i=0;
    while($i < count($word))
     {
        if(!is_numeric($word[$i]))
         {
            if($word[$i] != ' ' || count($word[$i]) < '1')
             {
                $word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
             }
         }
        $i++;
     }
    //$word = "/" . implode('', $word) . "/";
    echo implode('', $word);
 }

if($_GET['act'] == 'list')
 {
    $link = mysql_connect('localhost', 'username', 'password', '1');
    mysql_select_db('peoples');
    $sql = "SELECT word FROM filters";
    $result = mysql_query($sql, $link);
    $i=0;
    while($i < mysql_num_rows($result))
     {
        echo mysql_result($result, $i, 'word') . "<br />";
        $i++;
     }
     echo '<hr>';
 }
?>
<html>
    <head>
        <title>RegEx Generator</title>
    </head>
    <body>
        <form action='badword.php?act=do' method='post'>
            Word: <input type='text' name='word' /><br />
            <input type='submit' value='Generate' />
        </form>
        <a href="badword.php?act=list">List Words</a>
    </body>
</html>
Merca answered 23/8, 2008 at 21:27 Comment(4)
On't-day orget-day ig-pay atin-lay. Urse-cay ords-way are-ar ill-st ite-quay eadable-ray. (former owner of the AOL nick Itshay).Stereo
you mean "On't-day orget-fay"Hyposthenia
This is a great reference, thank you for that. In application, however, I'm not sure how changing "hamburger" to "[h H][a A @][m M][b B I3 l3 i3][u U v V][r R][g G 6][e E 3][r R]" is going to help filter profanity.Angary
@Angary Sometimes users attempt to bypass bad word filters by using other characters instead of the conventional letters; instead of A, one could use @ to say a bad word. However, it may also help by including lower- and upper-case characters. For example, you have a database of bad words, from which you pass this code on, and use it to detect a bad word even if it was misspelled or tweaked through these means.Collude
N
7

If anyone needs an API, google currently provide a bad word indicator.

http://www.wdyl.com/profanity?q=naughtyword

{
response: "false"
}

Update: Google has now removed this service.

Noland answered 3/8, 2012 at 18:52 Comment(2)
Doesn't seem to be active anymore.Uncomfortable
Seeing as that list is down, raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/… is an option.Noland
V
4

I would say to just remove posts as you become aware of them, and block users who are overly explicit with their postings. You can say very offensive things without using any swear words. If you block the word ass (aka donkey), then people will just type a$$ or /\55, or whatever else they need to type to get past the filter.

Volz answered 24/8, 2008 at 1:23 Comment(0)
C
4

+1 on the Clbuttic mistake, I think it is important for "bad word" filters to scan for both leading and trailing spaces (e.g., " ass ") as opposed for just the exact string so that we won't have words like clbuttic, clbuttes, buttert, buttess, etc.

Croquette answered 30/8, 2008 at 8:21 Comment(2)
And don't block the town of Scunthorpe.Canzonet
Unfortunately, that doesn't get rid of curses at the beginning of a paragraph or near punctuation. If I had a paragraph that consisted of "(badword)!", it would fail your test.Saad
U
2

Wikipedia ClueBot has a bad word filter, read its source.

http://en.wikipedia.org/wiki/User:ClueBot/Source#Score_list

Unpretentious answered 2/9, 2010 at 4:29 Comment(0)
T
1

You could always convince the client to have a session of users just constantly posting expletives and make an easy solution to add them to the system. It is a lot of work but it will probably be more representative of the community.

Teenybopper answered 23/8, 2008 at 22:3 Comment(0)
M
-2

In researching this topic I determined that what was needed was more than just a list that does arbitrary replacements. I have built a web service that allows you to identify the level of 'cleanliness' you desire. It also makes an effort to identify false positives - i.e. where a word may be bad in one context but not in others. Take a look at http://filterlanguage.com

Maren answered 2/9, 2010 at 4:23 Comment(1)
The url was unreachable.Disconsider

© 2022 - 2024 — McMap. All rights reserved.