from __future__ import division
import urllib
import json
from math import log
def hits(word1,word2=""):
query = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=%s"
if word2 == "":
results = urllib.urlopen(query % word1)
else:
results = urllib.urlopen(query % word1+" "+"AROUND(10)"+" "+word2)
json_res = json.loads(results.read())
google_hits=int(json_res['responseData']['cursor']['estimatedResultCount'])
return google_hits
def so(phrase):
num = hits(phrase,"excellent")
#print num
den = hits(phrase,"poor")
#print den
ratio = num / den
#print ratio
sop = log(ratio)
return sop
print so("ugly product")
I need this code to calculate the Point wise Mutual Information which can be used to classify reviews as positive or negative. Basically I am using the technique specified by Turney(2002): http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf as an example for an unsupervised classification method for sentiment analysis.
As explained in the paper, the semantic orientation of a phrase is negative if the phrase is more strongly associated with the word "poor" and positive if it is more strongly associated with the word "excellent".
The code above calculates the SO of a phrase. I use Google to calculate the number of hits and calculate the SO.(as AltaVista is now not there)
The values computed are very erratic. They don't stick to a particular pattern. For example SO("ugly product") turns out be 2.85462098541 while SO("beautiful product") is 1.71395061117. While the former is expected to be negative and the other positive.
Is there something wrong with the code? Is there an easier way to calculate SO of a phrase (using PMI) with any Python library,say NLTK? I tried NLTK but was not able to find any explicit method which computes the PMI.