NLTK convert tokenized sentence to synset format
Asked Answered
R

2

8

I'm looking to get the similarity between a single word and each word in a sentence using NLTK.

NLTK can get the similarity between two specific words as shown below. This method requires that a specific reference to the word is given, in this case it is 'dog.n.01' where dog is a noun and we want to use the first (01) NLTK definition.

dog = wordnet.synset('dog.n.01')
cat = wordnet.synset('cat.n.01')
print  dog.path_similarity(cat)
>> 0.2

The problem is that I need to get the part of speech information from each word in the sentence. The NLTK package has the ability to get the parts of speech for each word in a sentence as shown below. However, these speech parts ('NN', 'VB', 'PRP'...) don't match up with the format that the synset takes as a parameter.

text = word_tokenize("They refuse to permit us to obtain the refuse permit")
pos_tag(text)
>> [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

Is is possible to get the synset formatted data from pos_tag() results in NLTK? By synset formatted I mean the format like dog.n.01

Rubellite answered 21/12, 2014 at 16:58 Comment(0)
J
10

You can use a simple conversion function:

from nltk.corpus import wordnet as wn

def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None

After tagging a sentence you can tie a word inside the sentence with a SYNSET using this function. Here's an example:

from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize

sentence = "I am going to buy some gifts"
tagged = pos_tag(word_tokenize(sentence))

synsets = []
lemmatzr = WordNetLemmatizer()

for token in tagged:
    wn_tag = penn_to_wn(token[1])
    if not wn_tag:
        continue

    lemma = lemmatzr.lemmatize(token[0], pos=wn_tag)
    synsets.append(wn.synsets(lemma, pos=wn_tag)[0])

print synsets

Result: [Synset('be.v.01'), Synset('travel.v.01'), Synset('buy.v.01'), Synset('gift.n.01')]

Justajustemilieu answered 23/12, 2014 at 11:59 Comment(2)
I found this lesk example, but your code seems to give better results, I wonder why that is (just curious): nltk.org/howto/wsd.htmlSomite
Lesk doesn't really work that well. The only reason for which this code may work better is because it gets the first synset. In wordnet the synsets are ordered by frequency. In other words, the first synset is the most probable if we don't take into account the context.Justajustemilieu
N
1

You can use the alternative form of wordnet.synset:

wordnet.synset('dog', pos=wordnet.NOUN)

You'll still need to translate the tags offered by pos_tag into those supported by wordnet.sysnset -- unfortunately, I don't know of a pre-built dictionary doing that, so (unless I'm missing the existence of such a correspondence table) you'll need to build your own (you can do that once and pickle it for subsequent reloading).

See http://www.nltk.org/book/ch05.html , subchapter 1, on how to get help about a specific tagset -- e.g nltk.help.upenn_tagset('N.*') will confirm that the UPenn tagset (which I believe is the default one used by pos_tag) uses 'N' followed by something to identify variants of what synset will see as a wordnet.NOUN.

I have not tried http://www.nltk.org/_modules/nltk/tag/mapping.html but it might be just what you require -- give it a try!

Naphthalene answered 21/12, 2014 at 17:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.