NLTK - Get and Simplify List of Tags
Asked Answered
H

2

6

I'm using the Brown Corpus. I want some way to print out all the possible tags and their names (not just tag abbreviations). There are also quite a few tags, is there a way to 'simplify' the tags? By simplify I mean combine two extremely similar tags into one and re-tag the merged words with the other tag?

Herder answered 11/6, 2015 at 20:50 Comment(0)
W
5

It's somehow discussed previously in:

The POS tag output from nltk.pos_tag are PennTreeBank tagset, https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html, see What are all possible pos tags of NLTK?

There are several approach but the simplest one might be to use only the first 2 characters of the POS as the main set of POS tags. This is because the first two characters in the POS tag represents the broad classes of POS in Penn Tree Bank tagset.

For instance NNS means plural noun, and NNP means proper noun and the NN tag subsumes all of it by representing the generic noun.

Here's a code example:

>>> from nltk.corpus import brown
>>> from collections import Counter

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'DTI': [u'any'], u'BEN': [u'been'], u'VBD': [u'said', u'produced', u'took', u'said'], u'NP$': [u"Atlanta's"], u'NN-TL': [u'County', u'Jury', u'City', u'Committee', u'City', u'Court', u'Judge', u'Mayor-nominate'], u'VBN': [u'conducted', u'charged', u'won'], u"''": [u"''", u"''", u"''"], u'WDT': [u'which', u'which', u'which'], u'JJ': [u'recent', u'over-all', u'possible', u'hard-fought'], u'VBZ': [u'deserves'], u'NN': [u'investigation', u'primary', u'election', u'evidence', u'place', u'jury', u'term-end', u'charge', u'election', u'praise', u'manner', u'election', u'term', u'jury', u'primary'], u',': [u',', u','], u'.': [u'.', u'.'], u'TO': [u'to'], u'NP': [u'September-October', u'Durwood', u'Pye', u'Ivan'], u'BEDZ': [u'was', u'was'], u'NR': [u'Friday'], u'NNS': [u'irregularities', u'presentments', u'thanks', u'reports', u'irregularities'], u'``': [u'``', u'``', u'``'], u'CC': [u'and'], u'RBR': [u'further'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'IN': [u'of', u'in', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'CS': [u'that', u'that'], u'NP-TL': [u'Fulton', u'Atlanta', u'Fulton'], u'HVD': [u'had', u'had'], u'IN-TL': [u'of'], u'VB': [u'investigate'], u'JJ-TL': [u'Grand', u'Executive', u'Superior']})
>>> len(x)
29

The shorten version looks like this:

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words()[1:100]:
...     x[pos[:2]].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'BE': [u'was', u'been', u'was'], u'VB': [u'said', u'produced', u'took', u'said', u'deserves', u'conducted', u'charged', u'investigate', u'won'], u'WD': [u'which', u'which', u'which'], u'RB': [u'further'], u'NN': [u'County', u'Jury', u'investigation', u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'manner', u'election', u'term', u'jury', u'Court', u'Judge', u'reports', u'irregularities', u'primary', u'Mayor-nominate'], u'TO': [u'to'], u'CC': [u'and'], u'HV': [u'had', u'had'], u'``': [u'``', u'``', u'``'], u',': [u',', u','], u'.': [u'.', u'.'], u"''": [u"''", u"''", u"''"], u'CS': [u'that', u'that'], u'AT': [u'an', u'no', u'The', u'the', u'the', u'the', u'the', u'the', u'the', u'The', u'the'], u'JJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought'], u'IN': [u'of', u'in', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'NP': [u'Fulton', u"Atlanta's", u'Atlanta', u'September-October', u'Fulton', u'Durwood', u'Pye', u'Ivan'], u'NR': [u'Friday'], u'DT': [u'any']})
>>> len(x)
19

Another solution is to use the universal postags, see http://www.nltk.org/book/ch05.html

>>> x = defaultdict(list)
>>> for word,pos in brown.tagged_words(tagset='universal')[1:100]:
...     x[pos].append(word)
... 
>>> x
defaultdict(<type 'list'>, {u'ADV': [u'further'], u'NOUN': [u'Fulton', u'County', u'Jury', u'Friday', u'investigation', u"Atlanta's", u'primary', u'election', u'evidence', u'irregularities', u'place', u'jury', u'term-end', u'presentments', u'City', u'Committee', u'charge', u'election', u'praise', u'thanks', u'City', u'Atlanta', u'manner', u'election', u'September-October', u'term', u'jury', u'Fulton', u'Court', u'Judge', u'Durwood', u'Pye', u'reports', u'irregularities', u'primary', u'Mayor-nominate', u'Ivan'], u'ADP': [u'of', u'that', u'in', u'that', u'of', u'of', u'of', u'for', u'in', u'by', u'of', u'in', u'by'], u'DET': [u'an', u'no', u'any', u'The', u'the', u'which', u'the', u'the', u'the', u'the', u'which', u'the', u'The', u'the', u'which'], u'.': [u'``', u"''", u'.', u',', u',', u'``', u"''", u'.', u'``', u"''"], u'PRT': [u'to'], u'VERB': [u'said', u'produced', u'took', u'said', u'had', u'deserves', u'was', u'conducted', u'had', u'been', u'charged', u'investigate', u'was', u'won'], u'CONJ': [u'and'], u'ADJ': [u'Grand', u'recent', u'Executive', u'over-all', u'Superior', u'possible', u'hard-fought']})
>>> len(x)
9
Womanish answered 12/6, 2015 at 0:59 Comment(2)
I printed out the total number of 'tags' in the brown corpus, and it gave me 472. I'm assuming this is because there are a number of tag combinations? Such as NP+HVZ-NC? That means that the word in question is possibly either of the two tags? When shortening these tags by pos[:2] it obviously cuts off tags such as "FW-AT+NP-TL" to just FW. That's an absurdly specific tag anyhow. I was really looking for a happy medium between Universal and Brown tagsets. Using [:2] I get a total of 49 which is a great number, I think. Even though there are some weird tags like ")-"Herder
Using a defaultdict with set is much more efficient hereEffectuate
H
2

Many of the tagsets in the NLTK's corpora come with predefined mappings to a simplified, "universal" tagset. In addition to being more convenient for many purposes, the simplified tagset allows a degree of compatibility between different corpora that allow remapping to the universal tagset.

For the brown corpus, you can simply fetch tagged words or sents like this:

brown.tagged_words(tagset="universal")

For example:

>>> print(brown.tagged_words()[:10])
[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'),
('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), 
('of', 'ADP')]

To see the definitions of the original, complex tags in the Brown corpus, use nltk.help.upenn_tagset() (as mentioned also in this answer, linked by Alvas). You can get the whole list by calling it without arguments, or pass an argument (a regexp) to get only the matching tag(s). The results include a brief definition and examples.

>>> nltk.help.brown_tagset("DT.*")
DT: determiner/pronoun, singular
    this each another that 'nother
DT$: determiner/pronoun, singular, genitive
    another's
DT+BEZ: determiner/pronoun + verb 'to be', present tense, 3rd person singular
    that's
DT+MD: determiner/pronoun + modal auxillary
    that'll this'll
...
Hephaestus answered 12/6, 2015 at 12:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.