How to find the shortest dependency path between two words in Python?
Asked Answered
H

3

17

I try to find the dependency path between two words in Python given dependency tree.

For sentence

Robots in popular culture are there to remind us of the awesomeness of unbound human agency.

I used practnlptools (https://github.com/biplab-iitb/practNLPTools) to get the dependency parsing result like:

nsubj(are-5, Robots-1)
xsubj(remind-8, Robots-1)
amod(culture-4, popular-3)
prep_in(Robots-1, culture-4)
root(ROOT-0, are-5)
advmod(are-5, there-6)
aux(remind-8, to-7)
xcomp(are-5, remind-8)
dobj(remind-8, us-9)
det(awesomeness-12, the-11)
prep_of(remind-8, awesomeness-12)
amod(agency-16, unbound-14)
amod(agency-16, human-15)
prep_of(awesomeness-12, agency-16)

which can also be visualized as (picture taken from https://demos.explosion.ai/displacy/) enter image description here

The path length between "robots" and "are" is 1, the path length between "robots" and "awesomeness" would be 4.

My question is given above dependency parse result, how can I get dependency path or dependency path length between two words?

From my current search result, would nltk's ParentedTree help?

Thanks!

Homocentric answered 29/9, 2015 at 3:39 Comment(2)
Hi @Sean, How can I get a similar tree structure for a sentence using nltk? I currently don't have any Grammer, and using the 'treebank' grammar, gives me error of some words not present in production list. Thank you.Euphuism
@Euphuism Acutally, using NLTK getting such a nice presentation is hard. I suggest using Spacy.Homocentric
D
13

Your problem can easily be conceived as a graph problem where we have to find the shortest path between two nodes.

To convert your dependency parse in a graph, we first have to deal with the fact that it comes as a string. You want to get this:

'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'

to look like this:

[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]

This way you can feed the tuple list to a graph constructor from the networkx module that will analyze the list and build a graph for you, plus give you a neat method that gives you the length of the shortest path between two given nodes.

Necessary imports

import re
import networkx as nx
from practnlptools.tools import Annotator

How to get your string in the desired tuple list format

annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']

dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
    m = pattern.search(dep)
    edges.append((m.group(1), m.group(2)))

How to build the graph

graph = nx.Graph(edges)  # Well that was easy

How to compute shortest path length

print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))

This script will reveal that the shortest path given the dependency parse is actually of length 2, since you can get from Robots-1 to awesomeness-12 by going through remind-8

1. xsubj(remind-8, Robots-1) 
2. prep_of(remind-8, awesomeness-12)

If you don't like this result, you might want to think about filtering some dependencies, in this case not allow the xsubj dependency to be added to the graph.

Documentary answered 1/10, 2015 at 19:13 Comment(0)
L
15

HugoMailhot's answer is great. I'll write something similar for spacy users who want to find the shortest dependency path between two words (whereas HugoMailhot's answer relies on practNLPTools).

The sentence:

Robots in popular culture are there to remind us of the awesomeness of unbound human agency.

has the following dependency tree:

enter image description here

Here is the code to find the shortest dependency path between two words:

import networkx as nx
import spacy
nlp = spacy.load('en')

# https://spacy.io/docs/usage/processing-text
document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)

print('document: {0}'.format(document))

# Load spacy's dependency tree into a networkx graph
edges = []
for token in document:
    # FYI https://spacy.io/docs/api/token
    for child in token.children:
        edges.append(('{0}-{1}'.format(token.lower_,token.i),
                      '{0}-{1}'.format(child.lower_,child.i)))

graph = nx.Graph(edges)

# https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
print(nx.shortest_path(graph, source='robots-0', target='agency-15'))

Output:

4
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']

To install spacy and networkx:

sudo pip install networkx 
sudo pip install spacy
sudo python -m spacy.en.download parser # will take 0.5 GB

Some benchmarks regarding spacy's dependency parsing: https://spacy.io/docs/api/

enter image description here

Lohner answered 23/1, 2017 at 23:56 Comment(0)
D
13

Your problem can easily be conceived as a graph problem where we have to find the shortest path between two nodes.

To convert your dependency parse in a graph, we first have to deal with the fact that it comes as a string. You want to get this:

'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'

to look like this:

[('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]

This way you can feed the tuple list to a graph constructor from the networkx module that will analyze the list and build a graph for you, plus give you a neat method that gives you the length of the shortest path between two given nodes.

Necessary imports

import re
import networkx as nx
from practnlptools.tools import Annotator

How to get your string in the desired tuple list format

annotator = Annotator()
text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']

dp_list = dep_parse.split('\n')
pattern = re.compile(r'.+?\((.+?), (.+?)\)')
edges = []
for dep in dp_list:
    m = pattern.search(dep)
    edges.append((m.group(1), m.group(2)))

How to build the graph

graph = nx.Graph(edges)  # Well that was easy

How to compute shortest path length

print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))

This script will reveal that the shortest path given the dependency parse is actually of length 2, since you can get from Robots-1 to awesomeness-12 by going through remind-8

1. xsubj(remind-8, Robots-1) 
2. prep_of(remind-8, awesomeness-12)

If you don't like this result, you might want to think about filtering some dependencies, in this case not allow the xsubj dependency to be added to the graph.

Documentary answered 1/10, 2015 at 19:13 Comment(0)
L
4

This answer relies on Stanford CoreNLP to obtain the dependency tree of a sentence. It borrows quite some code from HugoMailhot's answer when using networkx.

Before running the code, one needs to:

  1. sudo pip install pycorenlp (python interface for Stanford CoreNLP)
  2. Download Stanford CoreNLP
  3. Start a Stanford CoreNLP server as follows:

    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
    

Then one can run the following code to find the shortest dependency path between two words:

import networkx as nx
from pycorenlp import StanfordCoreNLP
from pprint import pprint

nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
def get_stanford_annotations(text, port=9000,
                             annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
    output = nlp.annotate(text, properties={
        "timeout": "10000",
        "ssplit.newlineIsSentenceBreak": "two",
        'annotators': annotators,
        'outputFormat': 'json'
    })
    return output

# The code expects the document to contains exactly one sentence.
document =  'Robots in popular culture are there to remind us of the awesomeness of'\
            'unbound human agency.'
print('document: {0}'.format(document))

# Parse the text
annotations = get_stanford_annotations(document, port=9000,
                                       annotators='tokenize,ssplit,pos,lemma,depparse')
tokens = annotations['sentences'][0]['tokens']

# Load Stanford CoreNLP's dependency tree into a networkx graph
edges = []
dependencies = {}
for edge in annotations['sentences'][0]['basic-dependencies']:
    edges.append((edge['governor'], edge['dependent']))
    dependencies[(min(edge['governor'], edge['dependent']),
                  max(edge['governor'], edge['dependent']))] = edge

graph = nx.Graph(edges)
#pprint(dependencies)
#print('edges: {0}'.format(edges))

# Find the shortest path
token1 = 'Robots'
token2 = 'awesomeness'
for token in tokens:
    if token1 == token['originalText']:
        token1_index = token['index']
    if token2 == token['originalText']:
        token2_index = token['index']

path = nx.shortest_path(graph, source=token1_index, target=token2_index)
print('path: {0}'.format(path))

for token_id in path:
    token = tokens[token_id-1]
    token_text = token['originalText']
    print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))

The output is:

document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
path: [1, 5, 8, 12]
Node 1  token_text: Robots
Node 5  token_text: are
Node 8  token_text: remind
Node 12 token_text: awesomeness

Note that Stanford CoreNLP can be tested online: http://nlp.stanford.edu:8080/parser/index.jsp

This answer was tested with Stanford CoreNLP 3.6.0., pycorenlp 0.3.0 and python 3.5 x64 on Windows 7 SP1 x64 Ultimate.

Lohner answered 25/1, 2017 at 22:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.