Chunking with rule-based grammar in spacy
Asked Answered
B

1

10

I have this simple example of chunking in nltk.

My data:

data = 'The little yellow dog will then walk to the Starbucks, where he will introduce them to Michael.'

...pre-processing ...

data_tok = nltk.word_tokenize(data) #tokenisation
data_pos = nltk.pos_tag(data_tok) #POS tagging

CHUNKING:

cfg_1 = "CUSTOMCHUNK: {<VB><.*>*?<NNP>}" #should return `walk to the Starbucks`, etc.
chunker = nltk.RegexpParser(cfg_1)
data_chunked = chunker.parse(data_pos)

This returns (among other stuff): (CUSTOMCHUNK walk/VB to/TO the/DT Starbucks/NNP), so it did what I wanted it to do.

Now my question: I want to switch to spacy for my projects. How would I do this in spacy?

I come as far as to tag it (the coarser .pos method will do for me):

from spacy.en import English    
parser = English()
parsed_sent = parser(u'The little yellow dog will then walk to the Starbucks, where')

def print_coarse_pos(token):
  print(token, token.pos_)

for sentence in parsed_sent.sents:
  for token in sentence:
    print_coarse_pos(token)

... which returns the tags and tokens The DET little ADJ yellow ADJ dog NOUN will VERB then ADV walk VERB ...

How could I extract chunks with my own grammar?

Brietta answered 18/4, 2016 at 15:50 Comment(1)
Maybe, simply taking the pos tags as a string and creating the regex grammers we need and parsing could help solve the problem. To get the right word, we need the reverse mapping from POS to word.Paly
D
4

Copied verbatim from https://github.com/spacy-io/spaCy/issues/342

There's a few ways to go about this. The closest functionality to that RegexpParser class is spaCy's Matcher. But for syntactic chunking, I would typically use the dependency parse. For instance, for NPs chunking you have the doc.noun_chunks iterator:

doc = nlp(text)
for np in doc.noun_chunks:
    print(np.text)

The basic way that this works is something like this:

for token in doc:
    if is_head_of_chunk(token)
        chunk_start = token.left_edge.i
        chunk_end = token.right_edge.i + 1
        yield doc[chunk_start : chunk_end]

You can define the hypothetical is_head_of function however you like. You can play around with the dependency parse visualizer to see the syntactic annotation scheme, and figure out what labels to use: http://spacy.io/demos/displacy

Dardanelles answered 27/4, 2016 at 5:12 Comment(2)
@Brietta I know you talked to Matthew Honnibal but I figured his reply should be posted here as well.Dardanelles
Same comment as in the quoted post, but I have the same question--how would an example is_head_of_chunk function defined? I think the regular expression for a grammar chunk is the hardest part.Shamanism

© 2022 - 2024 — McMap. All rights reserved.