NLTK Thinks that Imperatives are Nouns
Asked Answered
L

5

13

I'm using the pos_tagger on recipes. A problem I'm having is that the pos_tagger returns that words in the imperative tense are nouns, shouldn't they be verbs? For example:

With the input:

combine 1 1/2 cups floud, 3/4 cup sugar, salt and baking powder

The output is:

[('combine', 'NN'), ('1', 'CD'), ('1/2', 'CD'), ('cups', 'NNS'), ('floud', 'VBD'), (',',      ','), ('3/4', 'CD'), ('cup', 'NN'), ('sugar', 'NN'), (',', ','), ('salt', 'NN'), ('and', 'CC'), ('baking', 'VBG'), ('powder', 'NN')]

Here's the code I'm using for this:

    def part_of_speech(self,input_sentance):
        text = nltk.word_tokenize(input_sentance)
        return nltk.pos_tag(text)

Shouldn't 'combine' be tagged as some sort of verb? Is this the fault of the nltk? Or am I doing something wrong?

Lowbred answered 23/2, 2012 at 2:0 Comment(2)
Can you show us what code your currently using to generate the above.Blessing
While I don't know anything about nltk, in an unusual coincidence, this week the annual NFL Scouting Combine is being held in Indianapolis, in which "combine" is used as a noun.Clothilde
D
13

What you're seeing is a very common problem in traditional statistical natural language processing (NLP). In short, the data you are using the tagger on doesn't look like the data it was trained on. NLTK doesn't document the details, but as far as I know the default tagger is trained on Wall Street Journal articles, the Brown Corpus, or some combination of the two. These corpora contain very few imperatives, so when you give it data with imperatives it doesn't do the right thing.

A good long-term solution would be to correct the tags for a large corpus of recipes and train on the corrected data, that way you solve the problem of mismatch between the training and testing data. This is, however, a huge amount of work. Ideally, a corpus with a lot of imperatives would already exist; my research group has looked into this and we have not found a suitable one, although we are in the process of producing one.

A much simpler solution that I've been using on a recent project that required imperatives to be understood correctly is to simply note what the imperatives are that you want, and force the tags for those words to be correct.

So in the example below, I made a dictionary saying that "combine" should be treated as a verb, and then used a list comprehension to change the tags.

tagged_words = [('combine', 'NN'), ('1', 'CD'), ('1/2', 'CD'), ('cups', 'NNS'), ('flour', 'VBD')]
force_tags = {'combine': 'VB'}
new_tagged_words = [(word, force_tags.get(word, tag)) for word, tag in tagged_words]

The contents of new_tagged_words now has the original tags except changed wherever there was an entry in force_tags.

>>> new_tagged_words
[('combine', 'VB'), ('1', 'CD'), ('1/2', 'CD'), ('cups', 'NNS'), ('flour', 'VBD')]

This solution does require you to say what the words you want to force to verbs are. This is far from ideal, but there isn't a better general solution.

Dincolo answered 5/3, 2012 at 19:32 Comment(4)
I see. So does this mean that POS is simply a string match? Or am I oversimplifying it?Lowbred
Taggers generally take at least two kinds of information into account: positional information between the tags (i.e., nouns follow determiners like 'the'), and information about the possible tags each word can take (i.e., how often 'steer' is a verb or noun). In this case, it's mostly the positional information that's the problem because in the training data sentences almost never start with verbs.Dincolo
For information, I have the same problem when I analyse German texts with the MatePosTagger.Algiers
I also noticed that capitalization, at the start of an imperative sentence, tends to make it think the first word is pronoun. For example: "Go north" -> [('Go', 'NNP'), ('north', 'RB')], whereas "go north" -> [('go', 'VB'), ('north', 'JJ')]. "Pick up the map" -> [('Pick', 'NNP'), ('up', 'RP'), ('the', 'DT'), ('map', 'NN')], ... etc.Koel
D
4

Training on imperative corpora would be the best option. But if you don't have the time or don't think the effort is worth it, here is a simple solution (more of a hack): Just put a pronoun like 'they' before every sentence (which you are sure is imperative). Now nltk does a fine job with the default tagger.

Deni answered 20/12, 2012 at 19:46 Comment(0)
B
1

The 'combine' to noun map maybe due to the fact it thinks it is a noun. A combine harvester for example. My guess is you should tune the noun algorithm for your use case or change/modify the word corpus.

Blessing answered 23/2, 2012 at 2:4 Comment(2)
how do you go about doing that? I'm a complete noob when it comes to NLTKLowbred
There a two great nltk python books I have both. If this is going to be a big thing for you get them lol. Otherwise raise your bounty and I might code up an example for you.Blessing
H
1

Try the Stanford POS tagger.

I've had better luck with it. It has been trained with more imperative sentences compared to the default NLTK tagger.

Also dockerized at cuzzo/stanford-pos-tagger.

e.g.

Follow us on Instagram
VB PRP IN NN
Hexamerous answered 1/5, 2016 at 23:49 Comment(2)
How do you swap out the NLTK tagger for the Stanford one?Saint
I tried it and it does indeed get imperatives right, e.g. "Go north" -> [('Go', 'VB'), ('north', 'RB')], not [('Go', 'NNP'), ('north', 'RB')]. However, I found it about 1500 times slower than the built-in NLTK pos tagger!Koel
D
0
>>> from nltk import pos_tag, word_tokenize
>>> def imperative_pos_tag(sent):
...     return pos_tag(['He']+sent)[1:]
... 
>>> sent1 = 'combine 1 1/2 cups floud, 3/4 cup sugar, salt and baking powder'

>>> imperative_pos_tag(word_tokenize(sent1))
[('combine', 'VBD'), ('1', 'CD'), ('1/2', 'CD'), ('cups', 'NNS'), ('floud', 'VBD'), (',', ','), ('3/4', 'CD'), ('cup', 'NN'), ('sugar', 'NN'), (',', ','), ('salt', 'NN'), ('and', 'CC'), ('baking', 'VBG'), ('powder', 'NN')]

Also, take a look at Python NLTK pos_tag not returning the correct part-of-speech tag and NLTK identifies verb as Noun in Imperatives

Drusy answered 26/8, 2015 at 11:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.