Firstly, let's take a look at the POS tags that NLTK gives:
>>> from nltk import pos_tag
>>> sent = 'The pizza was awesome and brilliant'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')]
>>> sent = 'The pizza was good but pasta was bad'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]
(Note: The above are the outputs from NLTK v3.1 pos_tag
, older version might differ)
What you want to capture is essentially:
- NN VBD JJ CC JJ
- NN VBD JJ
So let's catch them with these patterns:
>>> from nltk import RegexpParser
>>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant']
>>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad']
>>> patterns = """
... P: {<NN><VBD><JJ><CC><JJ>}
... {<NN><VBD><JJ>}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
So that's "cheating" by hardcoding!!!
Let's go back to the POS patterns:
- NN VBD JJ CC JJ
- NN VBD JJ
Can be simplified to:
So you can use the optional operators in the regex, e.g.:
>>> patterns = """
... P: {<NN><VBD><JJ>(<CC><JJ>)?}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
Most probably you're using the old tagger, that's why your patterns are different but I guess you see how you could capture the phrases you need using the example above.
The steps are:
- First, check what is the POS patterns using the
pos_tag
- Then generalize patterns and simplify them
- Then put them into the
RegexpParser
pizza was good
a noun phrase, neither is it a verb phrase since you dropped the determiner. It's more like some phrase structure that you want to extract. – Cherubini