SyntaxNet creating tree to root verb
Asked Answered
S

3

7

I am new to Python and the world of NLP. The recent announcement of Google's Syntaxnet intrigued me. However I am having a lot of trouble understanding documentation around both syntaxnet and related tools (nltk, etc.)

My goal: given an input such as "Wilbur kicked the ball" I would like to extract the root verb (kicked) and the object it pertains to "the ball".

I stumbled across "spacy.io" and this visualization seems to encapsulate what I am trying to accomplish: POS tag a string, and load it into some sort of tree structure so that I can start at the root verb and traverse the sentence.

I played around with the syntaxnet/demo.sh, and as suggested in this thread commented out the last couple lines to get conll output.

I then loaded this input in a python script (kludged together myself, probably not correct):

import nltk
from nltk.corpus import ConllCorpusReader
columntypes = ['ignore', 'words', 'ignore', 'ignore', 'pos']
corp = ConllCorpusReader('/Users/dgourlay/development/nlp','input.conll', columntypes)

I see that I have access to corp.tagged_words(), but no relationship between the words. Now I am stuck! How can I load this corpus into a tree type structure?

Any help is much appreciated!

Subinfeudate answered 17/5, 2016 at 8:33 Comment(1)
To me it seems you have missed out the parsing part. Once you prepossess your data i.e., tokenize the raw text, POS tag and convert it to conll format, you need to pass it to the parser (SyntaxNet in your case). Then you can do any sort of extraction, that you want, on the parser output.Emersion
B
3

This may have been better as a comment, but I don't yet have the required reputation.

I haven't used the ConllCorpusreader before (would you consider uploading the file you are loading to a gist and providing a link? It would be much easier to test), but I wrote a blog post which may help with the tree parsing aspect: here.

In particular, you probably want to chunk each sentence. Chapter 7 of the NLTK book has some more information on this, but this is the example from my blog:

# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
NBAR:
  # Nouns and Adjectives, terminated with Nouns
  {<NN.*|JJ>*<NN.*>}

NP:
  {<NBAR>}
    # Above, connected with in/of/etc...
  {<NBAR><IN><NBAR>}
"""

chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)

Note: You could also use a Context Free Grammar (covered in Chapter 8).

Each chunked (or parsed) sentence (or in this example Noun Phrase, according to the grammar above) will be a subtree. To access these subtrees, we can use this function:

def leaves(tree):
  """Finds NP (nounphrase) leaf nodes of a chunk tree."""
  for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
    yield subtree.leaves()

Each of the yielded objects will be a list of word-tag pairs. From there you can find the verb.

Next, you could play with the grammar above or the parser. Verbs split noun phrases (see this diagram in Chapter 7), so you can probably just access the first NP after a VBD.

Sorry for the solution not being specific to your problem, but hopefully it is a helpful starting point. If you upload the file(s) I'll take another shot :)

Battlement answered 26/5, 2016 at 7:7 Comment(0)
C
2

What you are trying to do is to find a dependency, namely dobj. I'm not yet familiar enough with SyntaxNet/Parsey to tell you how exactly to go extracting that dependency from it's output, but I believe this answer might help you. In short, you can configure Parsey to use ConLL syntax for output, parse it into whatever you find easy to traverse, then look for ROOT dependency to find the verb and *obj dependencies to find its objects.

Chemise answered 22/5, 2016 at 13:10 Comment(1)
Thanks. I suppose the part I am stuck at is parsing the ConLL output. As you can see in my example above I've loaded it using the ConllCorpusReader, but I can't figure out how to traverse it as a tree from the root verb.Subinfeudate
E
0

If you have parsed the raw text in the conll format using whatever parser, you can follow the steps to traverse the dependents of a node that you are interested in:

  1. build an adjacency matrix from the output conll sentence
  2. look for the node you are interested in (verb in your case) and extract its dependents from the adjacency matrix (indices)
  3. for each dependent look for its dependency label in the 8th column in the conll format.

PS: I can provide the code, but it would be better if you can code it yourself.

Emersion answered 26/5, 2016 at 14:46 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.