What’s a good Python profanity filter library? [closed]
Asked Answered
F

6

34

Like https://stackoverflow.com/questions/1521646/best-profanity-filter, but for Python — and I’m looking for libraries I can run and control myself locally, as opposed to web services.

(And whilst it’s always great to hear your fundamental objections of principle to profanity filtering, I’m not specifically looking for them here. I know profanity filtering can’t pick up every hurtful thing being said. I know swearing, in the grand scheme of things, isn’t a particularly big issue. I know you need some human input to deal with issues of content. I’d just like to find a good library, and see what use I can make of it.)

Foreman answered 20/8, 2010 at 14:20 Comment(2)
pip install -U expletives?Iodate
The better_profanity library has a fairly comprehensive list and handles lots of alternate spellings of words where characters (i.e. @, 3, 4) are substituted for letters (a, E, A respectively). I believe that the wordlist is customizable too.Shelley
L
45

I didn't found any Python profanity library, so I made one myself.

Parameters


filterlist

A list of regular expressions that match a forbidden word. Please do not use \b, it will be inserted depending on inside_words.

Example: ['bad', 'un\w+']

ignore_case

Default: True

Self-explanatory.

replacements

Default: "$@%-?!"

A string with characters from which the replacements strings will be randomly generated.

Examples: "%&$?!" or "-" etc.

complete

Default: True

Controls if the entire string will be replaced or if the first and last chars will be kept.

inside_words

Default: False

Controls if words are searched inside other words too. Disabling this

Module source


(examples at the end)

"""
Module that provides a class that filters profanities

"""

__author__ = "leoluk"
__version__ = '0.0.1'

import random
import re

class ProfanitiesFilter(object):
    def __init__(self, filterlist, ignore_case=True, replacements="$@%-?!", 
                 complete=True, inside_words=False):
        """
        Inits the profanity filter.

        filterlist -- a list of regular expressions that
        matches words that are forbidden
        ignore_case -- ignore capitalization
        replacements -- string with characters to replace the forbidden word
        complete -- completely remove the word or keep the first and last char?
        inside_words -- search inside other words?

        """

        self.badwords = filterlist
        self.ignore_case = ignore_case
        self.replacements = replacements
        self.complete = complete
        self.inside_words = inside_words

    def _make_clean_word(self, length):
        """
        Generates a random replacement string of a given length
        using the chars in self.replacements.

        """
        return ''.join([random.choice(self.replacements) for i in
                  range(length)])

    def __replacer(self, match):
        value = match.group()
        if self.complete:
            return self._make_clean_word(len(value))
        else:
            return value[0]+self._make_clean_word(len(value)-2)+value[-1]

    def clean(self, text):
        """Cleans a string from profanity."""

        regexp_insidewords = {
            True: r'(%s)',
            False: r'\b(%s)\b',
            }

        regexp = (regexp_insidewords[self.inside_words] % 
                  '|'.join(self.badwords))

        r = re.compile(regexp, re.IGNORECASE if self.ignore_case else 0)

        return r.sub(self.__replacer, text)


if __name__ == '__main__':

    f = ProfanitiesFilter(['bad', 'un\w+'], replacements="-")    
    example = "I am doing bad ungood badlike things."

    print f.clean(example)
    # Returns "I am doing --- ------ badlike things."

    f.inside_words = True    
    print f.clean(example)
    # Returns "I am doing --- ------ ---like things."

    f.complete = False    
    print f.clean(example)
    # Returns "I am doing b-d u----d b-dlike things."
Longerich answered 20/8, 2010 at 17:26 Comment(8)
Profanity isn't primarily about words, but usage; most words which can be used as "profanity" have perfectly "clean" uses, and it takes a lot more than a regex to distinguish them. (Never mind, of course, that anything like this will only prompt people to wrk arund it.)Trouper
@Glenn: yes, we know. We know filtering isn’t a complete solution to whatever profanity problem one has. We just want to know what the decent libraries are.Foreman
@Paul: Are you including leoluk in "we"? Any "decent library" is going to need to perform lexical analysis, bayesian heuristics or the like to discern different uses--not just run a regex. This code is cute, but isn't much more of a real-world solution than the bork below.Trouper
@Glenn: I wouldn’t dream of speaking for the good fellow. And not necessarily — because computers don’t understand English, the library is not going to be able to do the entire job itself, it’s going to need human help. So running a regex may turn out to be the right balance between power and comprehensible code. Hence I say “good library” and “decent library”, not “magical perfect library”.Foreman
@Paul: An approach that only searches for words without attempting to discern the context works fine for a small subset of language, but leaves a huge quantity of language undetectable. If blocking Carlin's list is all you want to do (to check a box on a feature requirement) then that's okay--but I think there's a significant area of practical analysis beyond that which can be done to make it something closer to practically useful. (@Brian's suggestion may be one, but I havn't tried it and they don't offer a public online demo. "Ask us for a demo" is never a promising sign.)Trouper
@Glenn: “If blocking Carlin's list is all you want to do (to check a box on a feature requirement)” — who said I want to do either of those things? You’re right that there is a lot of potential for doing something more useful than just regexing for words, but you haven’t offered any of that in an answer yet.Foreman
I pointed out that this solution is not very practically useful, and I did so because it seemed obvious that you were looking for something more than trivial word matching. I'm starting to regret wasting my time.Trouper
@GlennMaynard you're wrong; if I'm generating passwords or one-time codes or whatever I just want a basic filter. Practical use found.Madeleinemadelena
A
22
arrBad = [
'2g1c',
'2 girls 1 cup',
'acrotomophilia',
'anal',
'anilingus',
'anus',
'arsehole',
'ass',
'asshole',
'assmunch',
'auto erotic',
'autoerotic',
'babeland',
'baby batter',
'ball gag',
'ball gravy',
'ball kicking',
'ball licking',
'ball sack',
'ball sucking',
'bangbros',
'bareback',
'barely legal',
'barenaked',
'bastardo',
'bastinado',
'bbw',
'bdsm',
'beaver cleaver',
'beaver lips',
'bestiality',
'bi curious',
'big black',
'big breasts',
'big knockers',
'big tits',
'bimbos',
'birdlock',
'bitch',
'black cock',
'blonde action',
'blonde on blonde action',
'blow j',
'blow your l',
'blue waffle',
'blumpkin',
'bollocks',
'bondage',
'boner',
'boob',
'boobs',
'booty call',
'brown showers',
'brunette action',
'bukkake',
'bulldyke',
'bullet vibe',
'bung hole',
'bunghole',
'busty',
'butt',
'buttcheeks',
'butthole',
'camel toe',
'camgirl',
'camslut',
'camwhore',
'carpet muncher',
'carpetmuncher',
'chocolate rosebuds',
'circlejerk',
'cleveland steamer',
'clit',
'clitoris',
'clover clamps',
'clusterfuck',
'cock',
'cocks',
'coprolagnia',
'coprophilia',
'cornhole',
'cum',
'cumming',
'cunnilingus',
'cunt',
'darkie',
'date rape',
'daterape',
'deep throat',
'deepthroat',
'dick',
'dildo',
'dirty pillows',
'dirty sanchez',
'dog style',
'doggie style',
'doggiestyle',
'doggy style',
'doggystyle',
'dolcett',
'domination',
'dominatrix',
'dommes',
'donkey punch',
'double dong',
'double penetration',
'dp action',
'eat my ass',
'ecchi',
'ejaculation',
'erotic',
'erotism',
'escort',
'ethical slut',
'eunuch',
'faggot',
'fecal',
'felch',
'fellatio',
'feltch',
'female squirting',
'femdom',
'figging',
'fingering',
'fisting',
'foot fetish',
'footjob',
'frotting',
'fuck',
'fucking',
'fuck buttons',
'fudge packer',
'fudgepacker',
'futanari',
'g-spot',
'gang bang',
'gay sex',
'genitals',
'giant cock',
'girl on',
'girl on top',
'girls gone wild',
'goatcx',
'goatse',
'gokkun',
'golden shower',
'goo girl',
'goodpoop',
'goregasm',
'grope',
'group sex',
'guro',
'hand job',
'handjob',
'hard core',
'hardcore',
'hentai',
'homoerotic',
'honkey',
'hooker',
'hot chick',
'how to kill',
'how to murder',
'huge fat',
'humping',
'incest',
'intercourse',
'jack off',
'jail bait',
'jailbait',
'jerk off',
'jigaboo',
'jiggaboo',
'jiggerboo',
'jizz',
'juggs',
'kike',
'kinbaku',
'kinkster',
'kinky',
'knobbing',
'leather restraint',
'leather straight jacket',
'lemon party',
'lolita',
'lovemaking',
'make me come',
'male squirting',
'masturbate',
'menage a trois',
'milf',
'missionary position',
'motherfucker',
'mound of venus',
'mr hands',
'muff diver',
'muffdiving',
'nambla',
'nawashi',
'negro',
'neonazi',
'nig nog',
'nigga',
'nigger',
'nimphomania',
'nipple',
'nipples',
'nsfw images',
'nude',
'nudity',
'nympho',
'nymphomania',
'octopussy',
'omorashi',
'one cup two girls',
'one guy one jar',
'orgasm',
'orgy',
'paedophile',
'panties',
'panty',
'pedobear',
'pedophile',
'pegging',
'penis',
'phone sex',
'piece of shit',
'piss pig',
'pissing',
'pisspig',
'playboy',
'pleasure chest',
'pole smoker',
'ponyplay',
'poof',
'poop chute',
'poopchute',
'porn',
'porno',
'pornography',
'prince albert piercing',
'pthc',
'pubes',
'pussy',
'queaf',
'raghead',
'raging boner',
'rape',
'raping',
'rapist',
'rectum',
'reverse cowgirl',
'rimjob',
'rimming',
'rosy palm',
'rosy palm and her 5 sisters',
'rusty trombone',
's&m',
'sadism',
'scat',
'schlong',
'scissoring',
'semen',
'sex',
'sexo',
'sexy',
'shaved beaver',
'shaved pussy',
'shemale',
'shibari',
'shit',
'shota',
'shrimping',
'slanteye',
'slut',
'smut',
'snatch',
'snowballing',
'sodomize',
'sodomy',
'spic',
'spooge',
'spread legs',
'strap on',
'strapon',
'strappado',
'strip club',
'style doggy',
'suck',
'sucks',
'suicide girls',
'sultry women',
'swastika',
'swinger',
'tainted love',
'taste my',
'tea bagging',
'threesome',
'throating',
'tied up',
'tight white',
'tit',
'tits',
'titties',
'titty',
'tongue in a',
'topless',
'tosser',
'towelhead',
'tranny',
'tribadism',
'tub girl',
'tubgirl',
'tushy',
'twat',
'twink',
'twinkie',
'two girls one cup',
'undressing',
'upskirt',
'urethra play',
'urophilia',
'vagina',
'venus mound',
'vibrator',
'violet blue',
'violet wand',
'vorarephilia',
'voyeur',
'vulva',
'wank',
'wet dream',
'wetback',
'white power',
'women rapping',
'wrapping men',
'wrinkled starfish',
'xx',
'xxx',
'yaoi',
'yellow showers',
'yiffy',
'zoophilia']

def profanityFilter(text):
brokenStr1 = text.split()
badWordMask = '!@#$%!@#$%^~!@%^~@#$%!@#$%^~!'
new = ''
for word in brokenStr1:
    if word in arrBad:
        print word + ' <--Bad word!'
        text = text.replace(word,badWordMask[:len(word)])
        #print new

return text

print profanityFilter("this thing sucks sucks sucks fucking stuff")

You can add or remove from the bad words list,arrBad, as you please.

Alt answered 17/7, 2013 at 17:7 Comment(4)
There are some phrases in here that are gems that I actually had to look up because I hadn't heard them before.Polley
Regular expressions are helpful for a bad words list. A co-worker wrote a Perl based bad words list for a commenting system that that looked for things like substituted characters in bad words.Plauen
Is there a library which recognizes phrases as obscene? For example, general filters available now won't take '2 girls 1 cup1' as profane. I tried this but even if I add custom phrases as strings, it doesn't workVineland
Why not to mention the source of bad word list you used hereTreillage
R
5

WebPurify is a Profanity Filter Library for Python

Rubenstein answered 29/10, 2012 at 16:9 Comment(2)
I recommend WebPurify as well. You can find the Python extension on this page: webpurify.com/documentation/additional/extensionsLuca
WebPurify is an online service which can supposedly find profanities in images and videos, but like any service, it's going to add latency. Other libraries process locally against a list of profane words, which is always going to be quicker than making API calls.Lohrman
E
4

You could probably combine http://spambayes.sourceforge.net/ and http://www.cs.cmu.edu/~biglou/resources/bad-words.txt.

Endurance answered 7/10, 2010 at 2:40 Comment(1)
That is a good list. Pulled simply in Python 3: def obscenities(): from urllib.request import urlopen resp = urlopen("cs.cmu.edu/~biglou/resources/bad-words.txt") badwords = str(resp.read()).split('\\n') return badwordsKomarek
K
1

Profanity? What the f***'s that? ;-)

It will still take a couple of years before a computer will really be able to recognize swearing and cursing and it is my sincere hope that people will have understood by then that profanity is human and not "dangerous."

Instead of a dumb filter, have a smart human moderator who can balance the tone of discussion as appropriate. A moderator who can detect abuse like:

"If you were my husband, I'd poison your tea." - "If you were my wife, I'd drink it."

(that was from Winston Churchill, btw.)

Kenn answered 20/8, 2010 at 14:39 Comment(7)
Exactly. Profanity filters are pointless, at least until natural language parsers are much better.Clericals
@delnan: I guess because I asked what a good profanity filter library was, not whether I should use one at all. Suggestions like this can be better as comments, although they can be valid as answers too.Foreman
@Aaron: yeah, I’m not planning to have the machine deal with profanity on its own. But rather than making a human being look at every damn thing on the site, it’d be nice if the machine could offer suggestions of what’s worth taking a look at. (That’s not a criticism of your answer, as I didn’t provide any explanation of what I was going to use the filter for.)Foreman
@Aaron: oh, and I reckon it’ll be a lot longer than a couple of years before computers reliably understand English. And that the subset of people who care about the swears will not have gone away.Foreman
I downvote this - I categorically disagree with the concept of profanity being neutral.Newtonnext
@Paul: Calm down, you're missing my point. My point is that abuse is not only profanity. People have much worse ways to mob each other than simple curse words. This is what kills the soul of a community, not s&c.Kenn
Personally, as a moderator, I'd let that one through on account of sheer quality.Succinctorium
T
0

It's possible for users to work around this, of course, but it should do a fairly thorough job of removing profanity:

import re
def remove_profanity(s):
    def repl(word):
        m = re.match(r"(\w+)(.*)", word)
        if not m:
            return word
        word = "Bork" if m.group(1)[0].isupper() else "bork"
        word += m.group(2)
        return word
    return " ".join([repl(w) for w in s.split(" ")])

print remove_profanity("You just come along with me and have a good time. The Galaxy's a fun place. You'll need to have this fish in your ear.")
Trouper answered 7/10, 2010 at 2:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.