This is quite a difficult problem to solve and you need determine if regular expressions will work for you and how you handle embedding (when you add a dictionary word to profanity like frackface except with the real F-word).
Regular expressions generally have a limit to how long they can be and this usually prevents you from using a single regex for all your words. Executing multiple regular expressions against a string is really slow, depending on what performance you need and how big your blacklist gets. We initially implement CleanSpeak as a regular expression system, but it didn't scale and we rewrote it using a different mechanism.
You also need to consider phrases, punctuation, spaces, leet-speak and other languages. All of these make regular expressions less appealing as a solution. Here are some examples using the word hello (assume it is profanity for this exercise):
- List item
- h e l l o
- h.e.l.l.o
- h_e_l_l_o
- |-|ello
- h3llo
- "hello there" (this phrase might not contain any profane words but combined they are profane)
You also need to handle edge cases where two or more dictionary (whitelist) words contain a profanity when next to each other. Some examples that contain the s-word:
- bash it
- ssh it's quiet time
These are obviously not profanity, but most homegrown and many commercial solutions have problems with these cases.
We have spent the last 3 years perfecting the filter used by CleanSpeak to ensure it handles all of these cases and we continue to tweak it and make it better. We also spent 8 months perfecting our system for performance and it can handle about 5,000 messages per second. Not to say you can't build something usable, but be prepared to handle a lot of issues that might come up and also to create a system that doesn't use regular expressions.