Chunking some text with the stanford-nlp
Asked Answered
R

4

9

I'm using the stanford core NLP and I use this line to load some modules to process my text:

props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

Is ther a module that i can load to chunks the text?

Or any suggestion with a alterantive way to use the stanford core to chunk some text?

Thank you

Riki answered 28/11, 2011 at 17:35 Comment(2)
By "chunking" are you meaning picking out things like base NP chunks and verb groups? Or are you meaning dividing a large text up into segments, like related groupings of text such as individual blog comments?Sclerosed
I'm having the exact same question; in my case I mean extracting noun phrases for examplePrintmaking
H
5

I think the parser output can be used to obtain NP chunks. Take a look at the context-free representation on the Stanford Parser website which provides example output.

Histogen answered 13/11, 2012 at 1:20 Comment(0)
R
5

To use chunking with Stanford NLP you can use the following packages:

  • YamCha: SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
  • Mark Greenwood's Noun Phrase Chunker: A Java reimplementation of Ramshaw and Marcus (1995).
  • fnTBL: A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

Source: http://www-nlp.stanford.edu/links/statnlp.html#NPchunk

Rasputin answered 23/4, 2013 at 2:7 Comment(1)
These are just packages to do NP chunking. For eg: Mark Greenwood's Noun Phrase Chunker, provides a GATE wrapper, but not any wrapper for using StanfordNLP parse tree etc. I think one can do regex based chunking atleast - there can be a custom chunk annotator which adds to the pipeline. Say a custom annotator using TokenRegex on POS, put after "parse" in the pipeline. Such that the parse tree can have one more node "NNP" under which the chunked tokens are there. Hope someone has done that somewhere for coreNLP.Ryswick
P
1

What you need is the output of constituency parsing in CoreNLP which gives you the information of chunks e.g. Verb Phrases (VPs,) Noun Phrases (NPs,) and etc. To the best of my knowledge though, there is no method in CoreNLP to give you a list of chunks. It means that you have to parse the actual output of the constituency parsing to extract the chunks.

For example, this is the output of constituency parser of CoreNLP for a sample sentence:

(ROOT (S ("" "") (NP (NNP Anarchism)) (VP (VBZ is) (NP (NP (DT a) (JJ political) (NN philosophy)) (SBAR (WHNP (WDT that)) (S (VP (VBZ advocates) (NP (NP (JJ self-governed) (NNS societies)) (VP (VBN based) (PP (IN on) (NP (JJ voluntary) (, ,) (JJ cooperative) (NNS institutions))))))))) (, ,) (S (VP (VBG rejecting) (NP (JJ unjust) (NN hierarchy))))) (. .)))

As you see, there are NP and VP tags in the string, now you have to go and extract the actual text of chunks by parsing this string. Let me know if you could find a method that gives you the list of chunks?!

Protero answered 12/5, 2019 at 3:15 Comment(0)
L
0

Expanding upon the answer from Pedram, below code can be used:

from nltk.parse.corenlp import CoreNLPParser
nlp = CoreNLPParser('http://localhost:9000')  # Assuming CoreNLP server is running locally at port 9000


def extract_phrase(trees, labels):
    phrases = []
    for tree in trees:
        for subtree in tree.subtrees():
            if subtree.label() in labels:
                t = subtree
                t = ' '.join(t.leaves())
                phrases.append(t)
    return phrases


def get_chunks(sentence):
    trees = next(nlp.raw_parse(sentence))
    nps = extract_phrase(trees, ['NP', 'CC'])
    vps = extract_phrase(trees, ['VP'])
    return trees, nps, vps


if __name__ == '__main__':
    dialog = [
        "Anarchism is a political philosophy that advocates self-governed societies based on voluntary cooperative institutions rejecting unjust hierarchy"
    ]
    for sentence in dialog:
        trees, nps, vps = get_chunks(sentence)
        print("\n\n")
        print("Sentence: ", sentence)
        print("Tree:\n", trees)
        print("Noun Phrases: ", nps)
        print("Verb Phrases: ", vps)

"""
Sentence:  Anarchism is a political philosophy that advocates self-governed societies based on voluntary cooperative institutions rejecting unjust hierarchy
Tree:
 (ROOT
  (S
    (NP (NN Anarchism))
    (VP
      (VBZ is)
      (NP
        (NP (DT a) (JJ political) (NN philosophy))
        (SBAR
          (WHNP (WDT that))
          (S
            (VP
              (VBZ advocates)
              (NP
                (ADJP (NN self) (HYPH -) (VBN governed))
                (NNS societies))
              (PP
                (VBN based)
                (PP
                  (IN on)
                  (NP
                    (NP
                      (JJ voluntary)
                      (JJ cooperative)
                      (NNS institutions))
                    (VP
                      (VBG rejecting)
                      (NP (JJ unjust) (NN hierarchy)))))))))))))
Noun Phrases:  ['Anarchism', 'a political philosophy that advocates self - governed societies based on voluntary cooperative institutions rejecting unjust hierarchy', 'a political philosophy', 'self - governed societies', 'voluntary cooperative institutions rejecting unjust hierarchy', 'voluntary cooperative institutions', 'unjust hierarchy']
Verb Phrases:  ['is a political philosophy that advocates self - governed societies based on voluntary cooperative institutions rejecting unjust hierarchy', 'advocates self - governed societies based on voluntary cooperative institutions rejecting unjust hierarchy', 'rejecting unjust hierarchy']

"""
Lunt answered 9/6, 2021 at 13:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.