I did some questions about text-mining a week ago, but I was a bit confused and still, but now I know wgat I want to do.
The situation: I have a lot of download pages with HTML content. Some of then can bean be a text from a blog, for example. They are not structured and came from different sites.
What I want to do: I will split all the words with whitespace and I want to classify each one or a group of ones in some pre-defined itens like names, numbers, phone, email, url, date, money, temperature, etc.
What I know: I know the concepts/heard about about Natural Language Processing, Named Entity Reconigzer, POSTagging, NayveBayesian, HMM, training and a lot of things to do classification, etc., but there is some different NLP libraries with differents classifiers and ways to do this and I don't know what use or what do.
WHAT I NEED: I need some code example from a classifier, NLP, whatever, that can classify each word from a text separetely, and not a entire text. Something like this:
//This is pseudo-code for what I want, and not a implementation
classifier.trainFromFile("file-with-train-words.txt");
words = text.split(" ");
for(String word: words){
classifiedWord = classifier.classify(word);
System.out.println(classifiedWord.getType());
}
Somebody can help me? I'm confused with various APIs, classifiers and algorithms.