What is the best way to filter spam with JavaScript?
Asked Answered
N

1

8

I have recently been inspired to write spam filters in JavaScript, Greasemonkey-style, for several websites I use that are prone to spam (especially in comments). When considering my options about how to go about this, I realize I have several options, each with pros/cons. My goal for this question is to expand on this list I have created, and hopefully determine the best way of client-side spam filtering with JavaScript.

As for what makes a spam filter the "best", I would say these are the criteria:

  • Most accurate
  • Least vulnerable to attacks
  • Fastest
  • Most transparent

Also, please note that I am trying to filter content that already exists on websites that aren't mine, using Greasemonkey Userscripts. In other words, I can't prevent spam; I can only filter it.

Here is my attempt, so far, to compile a list of the various methods along with their shortcomings and benefits:


Rule-based filters:

What it does: "Grades" a message by assigning a point value to different criteria (i.e. all uppercase, all non-alphanumeric, etc.) Depending on the score, the message is discarded or kept.

Benefits:

  • Easy to implement
  • Mostly transparent

Shortcomings:

  • Transparent- it's usually easy to reverse engineer the code to discover the rules, and thereby craft messages which won't be picked up
  • Hard to balance point values (false positives)
  • Can be slow; multiple rules have to be executed on each message, a lot of times using regular expressions
  • In a client-side environment, server interaction or user interaction is required to update the rules

Bayesian filtering:

What it does: Analyzes word frequency (or trigram frequency) and compares it against the data it has been trained with.

Benefits:

  • No need to craft rules
  • Fast (relatively)
  • Tougher to reverse engineer

Shortcomings:

  • Requires training to be effective
  • Trained data must still be accessible to JavaScript; usually in the form of human-readable JSON, XML, or flat file
  • Data set can get pretty large
  • Poorly designed filters are easy to confuse with a good helping of common words to lower the spamacity rating
  • Words that haven't been seen before can't be accurately classified; sometimes resulting in incorrect classification of entire message
  • In a client-side environment, server interaction or user interaction is required to update the rules

Bayesian filtering- server-side:

What it does: Applies Bayesian filtering server side by submitting each message to a remote server for analysis.

Benefits:

  • All the benefits of regular Bayesian filtering
  • Training data is not revealed to users/reverse engineers

Shortcomings:

  • Heavy traffic
  • Still vulnerable to uncommon words
  • Still vulnerable to adding common words to decrease spamacity
  • The service itself may be abused
  • To train the classifier, it may be desirable to allow users to submit spam samples for training. Attackers may abuse this service

Blacklisting:

What it does: Applies a set of criteria to a message or some attribute of it. If one or more (or a specific number of) criteria match, the message is rejected. A lot like rule-based filtering, so see its description for details.

CAPTCHAs, and the like:

Not feasible for this type of application. I am trying to apply these methods to sites that already exist. Greasemonkey will be used to do this; I can't start requiring CAPTCHAs in places that they weren't before someone installed my script.


Can anyone help me fill in the blanks? Thank you,

Niobic answered 6/10, 2010 at 0:0 Comment(3)
So your goal is add spam checking on people's browser for sites which don't have enough protection builtin? That you can dynamically remove the comments from the site. Interesting, although I'm not sure how many sites would benefit from it.Dogtired
Really what I am trying to do is create a platform that allows just that. That comes first. Then I'll apply it to different sites. Facebook is one of my primary goals, as most of the comments are spam.Brillatsavarin
Spam filters are trivial to add on the backend. If the administrators for the site in question are too lazy to do that, you're better off not using the site to begin with. Why reward a crappy site by doing their work for them? It's strange you mention Facebook, as I've never seen spam there. You might just want to unfriend the spammers...Finish
P
4

There is no "best" way, especially for all users or all situations.

Keep it simple:

  1. Have the GM script initially hide all comments that contain links and maybe universally bad words (F*ck, Presbyterian, etc.). ;)
  2. Then the script contacts your server and lets the server judge each comment by X criteria (more on that, below).
  3. Show or hide comments based on the server response. In the event of a timeout, show or reveal based on a user preference setting ("What to do when the filter server is down? (show/hide comments with links) ).
  4. That's it for the GM script; the rest is handled by the server.

As for the actual server/filtering criteria...
Most important is do not dare to assume that you can guess what a user will want filtered! This will vary wildly from person to person, or even mood to mood.

Setup the server to use a combination of bad words, bad link destinations (.ru and .cn domains, for example) and public spam-filtering services.

The most important thing is to offer users some way to choose and ideally adjust what is applied, for them.

Paraffinic answered 6/10, 2010 at 0:0 Comment(1)
"There are no bad words" -- George CarlinEmergent

© 2022 - 2024 — McMap. All rights reserved.