What's the best way to parse a string for "bad" words in C#?
Asked Answered
A

3

9

I'm thinking of something like:

foreach (var word in paragraph.split(' ')) {
  if (badWordArray.Contains(word) {
    // do something about it
  }
}

but I'm sure there's a better way.

Thanks in advance!

UPDATE I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used. Then I'll review it myself to make sure it's legit. An auto flagging system of sorts.

Apostatize answered 9/7, 2010 at 3:20 Comment(2)
I went ahead and edited my solution in response to your update. Let me know if that answers your question.Chloride
possible duplicate of How do you implement a good profanity filter?Thready
C
16

While your way works, it may be a bit time consuming. There is a wonderful response here for a previous SO question. Though the question talks about PHP instead of C#, I think it can be easily ported.

Edit to add sample code:

public string FilterWords(string inputWords) {
    Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
    return wordFilter.Replace(inputWords, "<3");
}

That should work for you, more or less.

Edit to answer OP clarification:

I'm not looking to remove obscenities automatically... for my web app, I want to be notified if a word I deem "bad" is used.

Much as the replacement portion above, you can see if something matches like so:

public bool HasBadWords(string inputWords) {
    Regex wordFilter = new Regex("(puppies|kittens|dolphins|crabs)");
    return wordFilter.IsMatch(inputWords);
}

It will return true if the string you passed to it contains any words in the list.

Chloride answered 9/7, 2010 at 3:25 Comment(4)
If you're going to do this, don't forget the \b. It's a clbuttic mistake.Exceed
Haha well done. The word boundary is important for sure, but if you want to filter for things like redkittens or crabsapples, this would do it.Chloride
Thank you, I think a combination of your answer and Detmar's is what I'll end up doing. Much appreciated.Apostatize
I take it the regex way is more efficient than the looping way and only needs 1 pass?Dangerous
A
4

At my job we put some automatic bad word filtering into our software (it's kind of shocking to be browsing the source and suddenly run across the array containing several pages of obscenity).

One tip is to pre-process the user input before testing against your list, in that case that someone is trying to sneak something by you. So by way of preprocessing, we

  • uppercase everything in the input
  • remove most non-alphanumerics (that is, just splice out any spaces, or punctuation, etc.)
  • and then assuming someone is trying to pass off digits for letters, do the something like this: replace zero with O, 9 with G, 5 with S, etc. (get creative)

And then get some friends to try to break it. It's fun.

Andrel answered 9/7, 2010 at 5:3 Comment(2)
I like this... simple and effective for my purposes. Thanks.Apostatize
Not only that, asking your friends to break it is both good QA and a good night :)Innate
E
2

You could consider using the HashKey objects or Dictionary<T1, T2> instead of the array as using a Dictionary for example can make code more efficient, because the .Contains() method becomes .Keys.Contains() which is way more efficient. This is especially true if you have a large list of profanities (not sure how many there are! :)

Eloquent answered 9/7, 2010 at 3:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.