N-grams vs other classifiers in text categorization

Asked 1/12, 2013 at 18:54 Answered 3/12, 2013 at 10:16

Solved machine-learning data-mining classification n-gram text-classification

I'm new to text categorization techniques, I want to know the difference between the N-gram approach for text categorization and other classifier (decision tree, KNN, SVM) based text categorization.

i want to know which one is better, does n-grams comes under classifers ?. Does n-grams overcome any demerits in classifier techniques ?

where can i get comparative information regarding all this techniques.

thanks in advance.

Matthiew answered 1/12, 2013 at 18:54 Comment(0)

I'll actually post a full answer to this, since I think it's worth it being obvious that you can use n-gram models as classifiers (in much the same way as you can use any probability model of your features as one).

Generative classifiers approximate the posterior of interest, p(class | test doc) as:

p(c|t) \propto p(c) p(t|c)

where p(c) is the prior probability of c and p(t|c) is the likelihood. Classification picks the arg-max over all c. An n-gram language model, just like Naive Bayes or LDA or whatever generative model you like, can be construed as a probability model p(t|c) if you estimate a separate model for each class. As such, it can provide all the information required to do classification.

The question is whether the model is any use, of course. The major issue is that n-gram models tend to be built over billions of words of text, where classifiers are often trained on a few thousand. You can do complicated stuff like putting joint priors on the parameters of all the class' models, clamping hyperparameters to be equal (what these parameters are depends on how you do smoothing)... but it's still tricky.

An alternative is to build an n-gram model of characters (including spaces/punctuation if it turns out to be useful). This can be estimated much more reliably (26^3 parameters for tri-gram model instead of ~20000^3), and can be very useful for author identification/genre classification/other forms of classification that have stylistic elements.

Neogothic answered 3/12, 2013 at 10:16 Comment(0)

N-gram is not a classifier, it is a probabilistic language model, modeling sequences of basic units, where these basic units can be words, phonemes, letters, etc. N-gram is basically a probability distribution over sequences of length n, and it can be used when building a representation of a text.

A classifier is an algorithm, which may or may not use n-gram for the representation of texts.

Fictile answered 1/12, 2013 at 20:38 Comment(2)

It might be valuable to point out, that some classifiers- like Naive Bayes and Hidden Markov Models actually perform classification based on n-grams models (NB on unigrams, basic HMM on bigrams) - just for intuition, as of course these are more general concepts. – Schreibman 1/12, 2013 at 20:56

This is not entirely accurate. You can (and some people have) build a per-class n-gram model, and assign new documents to the class with the highest posterior probability (where the likelihood term is the language model instead of Naive Bayes, for example). If you define the model over sequences of characters instead of words, it actually works pretty well for author identification/other more stylistic classification problems. – Neogothic 3/12, 2013 at 9:42

Recommended topics

Hot tags