I am getting familiar with NLTK and text categorization by Jacob Perkins's book "Python Text Processing with NLTK 2.0 Cookbook".
My corpus documents/texts each consists of a paragraph of text, so each of them is in a separate line of file, not in a separate file. The number of such these paragraphs/lines are about 2 millions. Therefore there are about 2 million on machine learning instances.
Each line in my file (a paragraph of text - a combination of domain title, description, keywords), that is a subject of feature extraction: tokenization, etc. to make it an instance for a machine learning algorithm.
I have two files like that with all the positives and negavives.
How can I load it to CategorizedCorpusReader? Is it possible?
I tried other solutions before, like scikit, and finally picked NLTK hoping for an easier point to start with a result.
corpus.sents()[0]
,corpus.paras()[0]
,corpus.words()[0]
because these methods will give not give you the first document. – Saviour