How do I get a set of grammar rules from Penn Treebank using python & NLTK?

F

2

13

I'm fairly new to NLTK and Python. I've been creating sentence parses using the toy grammars given in the examples but I would like to know if it's possible to use a grammar learned from a portion of the Penn Treebank, say, as opposed to just writing my own or using the toy grammars? (I'm using Python 2.7 on Mac) Many thanks

Florella answered 14/8, 2011 at 13:13 Comment(0)

G

13

If you want a grammar that precisely captures the Penn Treebank sample that comes with NLTK, you can do this, assuming you've downloaded the Treebank data for NLTK (see comment below):

import nltk
from nltk.corpus import treebank
from nltk.grammar import ContextFreeGrammar, Nonterminal

tbank_productions = set(production for sent in treebank.parsed_sents()
                        for production in sent.productions())
tbank_grammar = ContextFreeGrammar(Nonterminal('S'), list(tbank_productions))

This will probably not, however, give you something useful. Since NLTK only supports parsing with grammars with all the terminals specified, you will only be able to parse sentences containing words in the Treebank sample.

Also, because of the flat structure of many phrases in the Treebank, this grammar will generalize very poorly to sentences that weren't included in training. This is why NLP applications that have tried to parse the treebank have not used an approach of learning CFG rules from the Treebank. The closest technique to that would be the Ren Bods Data Oriented Parsing approach, but it is much more sophisticated.

Finally, this will be so unbelievably slow it's useless. So if you want to see this approach in action on the grammar from a single sentence just to prove that it works, try the following code (after the imports above):

mini_grammar = ContextFreeGrammar(Nonterminal('S'),
                                  treebank.parsed_sents()[0].productions())
parser = nltk.parse.EarleyChartParser(mini_grammar)
print parser.parse(treebank.sents()[0])

Geibel answered 14/9, 2011 at 14:43 Comment(2)

I'm unable to run your second code snippet. It gives me the following error: Resource 'corpora/treebank/combined' not found. – Overscore 14/9, 2011 at 18:57

The most likely cause is that you didn't install the Treebank data when you installed NLTK. See the NLTK Data instructions. Basically, at a Python interpreter you'll need to import nltk, call nltk.download(), in the window that comes up click the "Corpora" tab, select "treebank," and finally click "Download" and close it when you're done. – Geibel 14/9, 2011 at 19:12

S

3

It is possible to train a Chunker on the treebank_chunk or conll2000 corpora. You don't get a grammar out of it, but you do get a pickle-able object that can parse phrase chunks. See How to Train a NLTK Chunker, Chunk Extraction with NLTK, and NLTK Classified Based Chunker Accuracy.

Susiesuslik answered 14/8, 2011 at 17:1 Comment(0)

Recommended topics

Hot tags