I'm looking to get the similarity between a single word and each word in a sentence using NLTK.
NLTK can get the similarity between two specific words as shown below. This method requires that a specific reference to the word is given, in this case it is 'dog.n.01' where dog is a noun and we want to use the first (01) NLTK definition.
dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
print dog.path_similarity(cat)
>> 0.2
The problem is that I need to get the part of speech information from each word in the sentence. The NLTK package has the ability to get the parts of speech for each word in a sentence as shown below. However, these speech parts ('NN', 'VB', 'PRP'...) don't match up with the format that the synset takes as a parameter.
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
pos_tag(text)
>> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Is is possible to get the synset formatted data from pos_tag() results in NLTK? By synset formatted I mean the format like dog.n.01
lesk
example, but your code seems to give better results, I wonder why that is (just curious): nltk.org/howto/wsd.html – Somite