Classify data using Apache Mahout
Asked Answered
C

2

11

I am trying to solve a simple classification problem.

The Problem:
I have a set of text and I have to categorize them based on the content.

Solution using Mahout:
I understood that I have to convert the input to a sequence file to generate the model. Yes, I was able to do this. Now, how do I categorize my test data? The 20News example only tests for correctness. But, I want to do the actual classification.
I am not sure if I need to write code or use some existing classes available to classify the test set.?

Choctaw answered 9/11, 2010 at 19:29 Comment(0)
J
3

I hate to plug my own work, but we put an entire section into Mahout in Action about classification. Theory, code examples, case study practice, even an entire server farm implementation.

You can get the pre-release version at http://www.manning.com/owen/

June answered 31/3, 2011 at 18:0 Comment(2)
IMO, the sections on classification in the book could be improved. The sections on classification is wordy, unclear and, often, non-sequitur. There could be more java coding examples and less bash shell examples.The classification section could be better if it were written more like the introduction chapters: Show the format for classification files, how to read them in, how to load them into your classifier, once trained, how to use the classifier to classify a new sample.Sower
I wish Mahout has more and better documentation. People who are experts at machine learning have a difficult time understanding the structure of the processing pipeline and the code architecture. Even the javadocs use inappropriate terminology (setGramSize should be setNGramSize) small semantics make a HUGE difference in understanding concepts and code.Sower
T
3

I am having a similar problem.

Running

bin/mahout org.apache.mahout.classifier.Classify --path <PATH TO MODEL> --classify <PATH TO TEXT FILE TO BE CLASSIFIED> --encoding UTF-8 --analyzer org.apache.mahout.vectorizer.DefaultAnalyzer --defaultCat unknown --gramSize 1 --classifierType bayes --dataSource hdfs

will classify a text file based on the model.

This might get you a bit further forward, but I'm guessing that, like me, you want to classify a whole load of documents and you want the output in a useful format.

Might have to program a bit of java to do this. Someone has an example that looks like it will do what I want at https://bitbucket.org/jaganadhg/blog/src/tip/bck9/java/src/org/bc/kl/ClassifierDemo.java

Tula answered 25/2, 2011 at 8:35 Comment(0)
J
3

I hate to plug my own work, but we put an entire section into Mahout in Action about classification. Theory, code examples, case study practice, even an entire server farm implementation.

You can get the pre-release version at http://www.manning.com/owen/

June answered 31/3, 2011 at 18:0 Comment(2)
IMO, the sections on classification in the book could be improved. The sections on classification is wordy, unclear and, often, non-sequitur. There could be more java coding examples and less bash shell examples.The classification section could be better if it were written more like the introduction chapters: Show the format for classification files, how to read them in, how to load them into your classifier, once trained, how to use the classifier to classify a new sample.Sower
I wish Mahout has more and better documentation. People who are experts at machine learning have a difficult time understanding the structure of the processing pipeline and the code architecture. Even the javadocs use inappropriate terminology (setGramSize should be setNGramSize) small semantics make a HUGE difference in understanding concepts and code.Sower

© 2022 - 2024 — McMap. All rights reserved.