text categorization classifiers
Asked Answered
O

4

7

Does anybody know of good open-source text-categorization models? I know about Stanford Classifier, Weka, Mallet, etc. but all of them require training.

I need to classify news articles into Sports/Politics/Health/Gaming/etc. Is there any pre-trained models out there?

Alchemy, OpenCalais, etc. are not options. I need open-source tools (preferably in Java).

Overwork answered 7/3, 2013 at 15:16 Comment(0)
A
5

Having a pre-trained model assumes that the corpus that was used to train is from the exact same domain as the documents you are trying to classify. Generally this is not going to give you the results you want because you don't have the original corpus. Machine learning is not static, when you train a classifier you need to update the model when new features/information becomes available.

Take for example classifying news articles like you want in the domain of Sports/Politics/Health/Gaming/etc.

First what language? Are we talking about english only? How was the original corpus labeled? And the biggest unknown is the etc. category.

Training your own classifier is really really easy. If you are classifying text, MALLET is the best choice. You can be up and running in lest than 10 minutes. You can add MALLET into your own application in under 1 hour.

If you want to classify news articles there are a lot of open source corpora that you can use as a base to start training. I would start with Reuters-21578 or RCV-1.

Alita answered 12/3, 2013 at 22:58 Comment(2)
Thanks a lot, Shane, for your answer. I will definitely look into the data sets you mentioned! But yes, I am working only on English data, and general domains of news articles (similar to those classified by Alchemy and OpenCalais). I will give MALLET a shot.Overwork
Great let me know if you have any problems!Alita
J
2

There are a lot of classifiers out there depending your need. First, I think you may want to narrow down what do you want to do with the classifiers.

And training is part of steps of classification, I don't think you will find much pre-trained classifiers out there. Besides, training is almost always easy part of the classification.

That being said, there are actually a lot of resources you can look at. I can't pretend to take credit of this, but this is one of the examples:

Weka - is a collection of machine learning algorithms for data mining. It is one of the most popular text classification frameworks. It contains implementations of a wide variety of algorithms including Naive Bayes and Support Vector Machines (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.

Apache Lucene Mahout - An incubator project to created highly scalable distributed implementations of common machine learning algorithms on top of the Hadoop map-reduce framework.

Source: http://www.searchenginecaffe.com/2007/03/java-open-source-text-mining-and.html

Johanson answered 14/3, 2013 at 7:38 Comment(1)
Thanks, Hearty, for you answer.Overwork
S
2

What you mean by classification is very important.

Classification is a supervised task, which requires a pre-labeled corpus beforehand. Moving from the already labeled corpus, you have to create a model by using several methods and approaches and finally you can classify an unlabeled test corpus by using that model. If this is the case, you can use a multi-class classifier which is generally a binary tree application of a binary classifier. State of the art approach for such kind of a task is using a branch of machine learning, SVM. Two of the best SVM classifiers are LibSVM and SVMlight. These are open-source, easy to use and include multi-class classification tools. Finally, you have to make a literature survey in order to understand what to do in addition to obtain good results, because using those classifiers are not enough by themselves. You have to manipulate/pre-process your corpus in order to extract information bearing parts (e.g. unigrams) and excluding noisy parts. In general, you most probably have a long way to go, but NLP is a very interesting topic and worthwhile to work on.

However, if what you mean by classification is clustering, then the problem will be more complicated. Clustering is an un-supervised task, which means you will include no information to the program you are using about which example belongs to which group/topic/class. There are also academic work on hybrid semi-supervised approaches, but they are a bit diverging from the real purpose of clustering problem. The pre-processing that you need to use while manipulating your corpus bears a similar nature with what you have to do in classification problem, so I will not mention it again. To do clustering, there are several approaches you have to follow. First, you can use LDA (Latent Dirichlet Allocation) method to reduce the dimensionality (number of dimensions of your feature-space) of your corpus, which will contribute to efficiency and information gain from features. Beside or after LDA, you can use Hierarchical Clustering or similar other methods such as K-Means in order to cluster your unlabeled corpus. You can use Gensim or Scikit-Learn as open-source tools for clustering. Both are powerful, well documented and easy to use tools.

In all cases, make a lot of academic reading and try to understand the theory beneath those tasks and problems. By this way, you can come up with innovative and efficient solutions for what you are specifically dealing with, because the problems in NLP are generally corpus dependent and you are generally on your own while dealing with your specific problem. It is very difficult to find generic and ready-to-use solutions and I do not recommend to rely on such an option as well.

I may over-answered your question, sorry for the irrelevant parts.

Good luck =)

Selfmoving answered 5/4, 2013 at 15:16 Comment(2)
Great answer! Thanks a lot. I am well aware about the classification. I was looking for a supervised approach, but with pre-trained modelsOverwork
Model is the primary outcome of your work in classification; all other things are for creating a good model that fits your needs. In that sense, trying to find a ready-to-use model is irrelevant and most probably impossible. This is mainly because the task you are trying to achieve, the corpus you are working on, the efficiency you need and all other aspects will be unique to you and your case; it is thus a matter of pure luck to find a model that will satisfy your goals. My advice is to get your hands dirty as soon as possible, good luck =)Selfmoving
J
0

There is a long list of pre-trained models for OpenNLP

http://opennlp.sourceforge.net/models-1.5/

Justis answered 7/3, 2013 at 15:48 Comment(2)
Thanks a lot, but none of them do text-categorizationOverwork
Not sure if this would suite your need cwiki.apache.org/MAHOUT/bayesian.html also quoting an example that comes along with their source. cwiki.apache.org/MAHOUT/wikipedia-bayes-example.htmlArdene

© 2022 - 2024 — McMap. All rights reserved.