Count verbs, nouns, and other parts of speech with python's NLTK
Asked Answered
I

1

20

I have multiple texts and I would like to create profiles of them based on their usage of various parts of speech, like nouns and verbs. Basially, I need to count how many times each part of speech is used.

I have tagged the text but am not sure how to go further:

tokens = nltk.word_tokenize(text.lower())
text = nltk.Text(tokens)
tags = nltk.pos_tag(text)

How can I save the counts for each part of speech into a variable?

Impalpable answered 20/5, 2012 at 15:41 Comment(1)
Have you come across the collections.Counter?Gerianne
R
34

The pos_tag method gives you back a list of (token, tag) pairs:

tagged = [('the', 'DT'), ('dog', 'NN'), ('sees', 'VB'), ('the', 'DT'), ('cat', 'NN')] 

If you are using Python 2.7 or later, then you can do it simply with:

>>> from collections import Counter
>>> counts = Counter(tag for word,tag in tagged)
>>> counts
Counter({'DT': 2, 'NN': 2, 'VB': 1})

To normalize the counts (giving you the proportion of each) do:

>>> total = sum(counts.values())
>>> dict((word, float(count)/total) for word,count in counts.items())
{'DT': 0.4, 'VB': 0.2, 'NN': 0.4}

Note that in older versions of Python, you'll have to implement Counter yourself:

>>> from collections import defaultdict
>>> counts = defaultdict(int)
>>> for word, tag in tagged:
...  counts[tag] += 1

>>> counts
defaultdict(<type 'int'>, {'DT': 2, 'VB': 1, 'NN': 2})
Rugen answered 20/5, 2012 at 15:49 Comment(4)
That's absolutely amazing, thank you. I am using Python 2.7. Is there a way that I can now figure out what proportion of the tagged text uses each part-of-speech? For instance, by dividing the number of nouns by the total tags and multiplying by 100 (to get a percent)... but doing that for everything? So get results like: 23% nouns, 14% verbs and so on?Impalpable
@Zach, I've added something about normalizing the counts for you.Rugen
@dgh, thanks it works great. One last question, do you know which tag set is used by nltk.pos_tag()? e.g. Brown, Penn Treebank, etc.?Impalpable
nltk.pos_tag() uses the Penn TreebankRefreshment

© 2022 - 2024 — McMap. All rights reserved.