I have this simple example of chunking in nltk.
My data:
data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'
...pre-processing ...
data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging
CHUNKING:
cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)
This returns (among other stuff): (CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP)
, so it did what I wanted it to do.
Now my question: I want to switch to spacy for my projects. How would I do this in spacy?
I come as far as to tag it (the coarser .pos
method will do for me):
from spacy.en import English
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')
def print_coarse_pos(token):
print(token, token.pos_)
for sentence in parsed_sent.sents:
for token in sentence:
print_coarse_pos(token)
... which returns the tags and tokens
The DET
little ADJ
yellow ADJ
dog NOUN
will VERB
then ADV
walk VERB
...
How could I extract chunks with my own grammar?