Testing the NLTK classifier on specific file

Asked 27/3, 2015 at 13:34 Answered 29/3, 2015 at 11:10

Solved python-2.7 nlp classification nltk text-classification

The following code run Naive Bayes movie review classifier. The code generate a list of the most informative features.

Note: **movie review** folder is in the nltk.

from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
stop = stopwords.words('english')

documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]


word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

classifier = NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)

link of code from alvas

how can I test the classifier on specific file?

Please let me know if my question is ambiguous or wrong.

Leroi answered 27/3, 2015 at 13:34 Comment(0)

First, read these answers carefully, they contain parts of the answers you require and also briefly explains what the classifier does and how it works in NLTK:

Testing classifier on annotated data

Now to answer your question. We assume that your question is a follow-up of this question: Using my own corpus instead of movie_reviews corpus for Classification in NLTK

If your test text is structured the same way as the movie_review corpus, then you can simply read the test data as you would for the training data:

Just in case the explanation of the code is unclear, here's a walkthrough:

traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

The two lines above is to read a directory my_movie_reviews with such a structure:

\my_movie_reviews
    \pos
        123.txt
        234.txt
    \neg
        456.txt
        789.txt
    README

Then the next line extracts documents with its pos/neg tag that's part of the directory structure.

documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

Here's the explanation for the above line:

# This extracts the pos/neg tag
labels = [i for i.split('/')[0]) for i in mr.fileids()]
# Reads the words from the corpus through the CategorizedPlaintextCorpusReader object
words = [w for w in mr.words(i)]
# Removes the stopwords
words = [w for w in mr.words(i) if w.lower() not in stop]
# Removes the punctuation
words = [w for w in mr.words(i) w not in string.punctuation]
# Removes the stopwords and punctuations
words = [w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation]
# Removes the stopwords and punctuations and put them in a tuple with the pos/neg labels
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

The SAME process should be applied when you read the test data!!!

Now to the feature processing:

The following lines extra top 100 features for the classifier:

# Extract the words features and put them into FreqDist
# object which records the no. of times each unique word occurs
word_features = FreqDist(chain(*[i for i,j in documents]))
# Cuts the FreqDist to the top 100 words in terms of their counts.
word_features = word_features.keys()[:100]

Next to processing the documents into classify-able format:

# Splits the training data into training size and testing size
numtrain = int(len(documents) * 90 / 100)
# Process the documents for training data
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
# Process the documents for testing data
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents[numtrain:]]

Now to explain that long list comprehension for train_set and `test_set:

# Take the first `numtrain` no. of documents
# as training documents
train_docs = documents[:numtrain]
# Takes the rest of the documents as test documents.
test_docs = documents[numtrain:]
# These extract the feature sets for the classifier
# please look at the full explanation on https://mcmap.net/q/555044/-nltk-naivebayesclassifier-training-for-sentiment-analysis/
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in train_docs]

You need to process the documents as above for the feature extractions in the test documents too!!!

So here's how you can read the test data:

stop = stopwords.words('english')

# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

# Now do the same for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]

Then continue with the processing steps described above, and simply do this to get the label for the test document as @yvespeirsman answered:

#### FOR TRAINING DATA ####
stop = stopwords.words('english')

# Reads the training data.
traindir = '/home/alvas/my_movie_reviews'
mr = CategorizedPlaintextCorpusReader(traindir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')

# Converts training data into tuples of [(words,label), ...]
documents = [([w for w in mr.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]
# Extract training features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Assuming that you're using full data set
# since your test set is different.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in documents]

#### TRAINS THE TAGGER ####
# Train the tagger
classifier = NaiveBayesClassifier.train(train_set)

#### FOR TESTING DATA ####
# Now do the same reading and processing for the testing data.
testdir = '/home/alvas/test_reviews'
mr_test = CategorizedPlaintextCorpusReader(testdir, r'(?!\.).*\.txt', cat_pattern=r'(neg|pos)/.*', encoding='ascii')
# Converts testing data into tuples of [(words,label), ...]
test_documents = [([w for w in mr_test.words(i) if w.lower() not in stop and w not in string.punctuation], i.split('/')[0]) for i in mr_test.fileids()]
# Reads test data into features:
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag  in test_documents]

#### Evaluate the classifier ####
for doc, gold_label in test_set:
    tagged_label = classifier.classify(doc)
    if tagged_label == gold_label:
        print("Woohoo, correct")
    else:
        print("Boohoo, wrong")

If the above code and explanation makes no sense to you, then you MUST read this tutorial before proceeding: http://www.nltk.org/howto/classify.html

Now let's say you have no annotation in your test data, i.e. your test.txt is not in the directory structure like the movie_review and just a plain textfile:

\test_movie_reviews
    \1.txt
    \2.txt

Then there's no point in reading it into a categorized corpus, you can simply do read and tag the documents, i.e.:

for infile in os.listdir(`test_movie_reviews): 
  for line in open(infile, 'r'):
       tagged_label = classifier.classify(doc)

BUT you CANNOT evaluate the results without annotation, so you can't check the tag if the if-else, also you need to tokenize your text if you're not using the CategorizedPlaintextCorpusReader.

If you just want to tag a plaintext file test.txt:

import string
from itertools import chain
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk import word_tokenize

stop = stopwords.words('english')

# Extracts the documents.
documents = [([w for w in movie_reviews.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in movie_reviews.fileids()]
# Extract the features.
word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = word_features.keys()[:100]
# Converts documents to features.
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents]
# Train the classifier.
classifier = NaiveBayesClassifier.train(train_set)

# Tag the test file.
with open('test.txt', 'r') as fin:
    for test_sentence in fin:
        # Tokenize the line.
        doc = word_tokenize(test_sentence.lower())
        featurized_doc = {i:(i in doc) for i in word_features}
        tagged_label = classifier.classify(featurized_doc)
        print(tagged_label)

Once again, please don't just copy and paste the solution and try to understand why and how it works.

Thermion answered 29/3, 2015 at 11:10 Comment(8)

Thank you for your complete explanation and I try to understand them. But I often encounter wrong result. I mean it should be pos but the program show neg. And I don't know the reason. – Leroi 30/3, 2015 at 7:36

There are many reasons and it's not perfect, maybe (i) the data is insufficient, (ii) features are not good enough, (iii) classifier choice, etc. Do take this course coursera.org/course/ml for more info. And if you can i strongly encourage you to attend lxmls.it.pt/2015 – Thermion 30/3, 2015 at 7:56

I am confused. First I get a file from nltk/movie_reviews/neg/cv081.txt. Then I decide to test the file by your program that give Woohoo,correctorwrong. I put the file in /home/neg/cv081.txtfor testing then I get Boohoo, wrongas output! Then I put the file in /home/pos/cv081.txt then I get Woohoo,correct as output! Then I test the same file on print(tagged_label) program it give me many negs. And about the program that print(tagged_label). I don’t know exactly how it works. It gives me many neg even for pos file!!. How can I evaluate these negs and poss output. – Leroi 30/3, 2015 at 9:40

You evaluate the output by finding out how often it is correct. Classifiers learn which features to pay attention to, and how to combine them in making their decision. There's no logical rule, it's all statistics and weights. Your file cv081.txt comes out as pos with your feature set -- what else is there to understand? – Porshaport 30/3, 2015 at 21:42

Go through the machine learning course on the coursea link and you will understand why and how the classifier works. I started out using them as black boxes and once you understand how they produce the annotations, it's easier to code and appreciate their elegance. – Thermion 31/3, 2015 at 5:16

The first case is when you have annotated data to test on, the second is when you have none. If you need us to validate the code's output, can you post the full dataset up somewhere so that we can test (when we're free)? – Thermion 31/3, 2015 at 5:20

Sorry I deleted my previous comment. it was this: The first case (Woohoo,correct or wrong) delivers pos but the second case (print(tagged_label)) gives neg as output. Both programs have the same feature set and same classifier.........I use the same dataset in the movie_reviews. BTW, what do you mean by annotated data. sorry for this simple question. – Leroi 31/3, 2015 at 5:37

There's nothing wrong with the code, especially if it's just the printing out of the labels... if you're testing on the file from /movie_reviews/neg. Of course all instances will be neg... PLEASE READ the nltk.org/book/ch06.html to understand how classification works... it explains why you get male/female labels and it's similar to pos/neg – Thermion 31/3, 2015 at 11:50

You can test on one file with classifier.classify(). This method takes as its input a dictionary with the features as its keys, and True or False as their values, depending on whether the feature occurs in the document or not. It outputs the most probable label for the file, according to the classifier. You can then compare this label with the correct label for the file to see if the classification is correct.

In your training and test sets, the feature dictionaries are always the first item in the tuples, the labels are the second item in the tuples.

Thus, you can classify the first document in the test set like so:

(my_document, my_label) = test_set[0]
if classifier.classify(my_document) == my_label:
    print "correct!"
else:
    print "incorrect!"

Garonne answered 29/3, 2015 at 3:19 Comment(4)

Could you please show me with a complete example and if possible your example be according to my example in question. I'm so new in Python. Could you please tell me why you write 0 in test_set[0] – Leroi 29/3, 2015 at 6:39

This is a complete example: if you paste the code immediately after your code in the question, it will work. The 0 simply takes the first document in your test set (the first item in a list has index 0). – Garonne 29/3, 2015 at 7:18

Thank you so much. Is there a way to write the name_of_file instead of 0 in test_set[0]? I don't know, the test_set exactly indicate to which file since we have 2 folder pos|neg and every folder has its files. I ask this because the most informative word was bad (the result of my example in question). The first file has more than 1 hundred of ‘bad’ word. But the program show incorrect in the output. Where is my mistake? – Leroi 29/3, 2015 at 8:37

First, test_set doesn't contain the filenames, so if you want to use that to identify a file, one way would be to read the file directly and pass it to the classifier as the feature dictionary I described above. Second, your current classifier uses binary features. It simply checks whether a word occurs in a document or not, but ignores the frequency with which the word occurs. That's probably why it misclassifies a file with many occurrences of bad. – Garonne 29/3, 2015 at 9:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags