Coreference resolution in python nltk using Stanford coreNLP
Asked Answered
H

4

12

Stanford CoreNLP provides coreference resolution as mentioned here, also this thread, this, provides some insights about its implementation in Java.

However, I am using python and NLTK and I am not sure how can I use Coreference resolution functionality of CoreNLP in my python code. I have been able to set up StanfordParser in NLTK, this is my code so far.

from nltk.parse.stanford import StanfordDependencyParser
stanford_parser_dir = 'stanford-parser/'
eng_model_path = stanford_parser_dir  + "stanford-parser-models/edu/stanford/nlp/models/lexparser/englishRNN.ser.gz"
my_path_to_models_jar = stanford_parser_dir  + "stanford-parser-3.5.2-models.jar"
my_path_to_jar = stanford_parser_dir  + "stanford-parser.jar"

How can I use coreference resolution of CoreNLP in python?

Harelda answered 9/9, 2016 at 11:12 Comment(0)
R
10

As mentioned by @Igor You can try the python wrapper implemented in this GitHub repo: https://github.com/dasmith/stanford-corenlp-python

This repo contains two main files: corenlp.py client.py

Perform the following changes to get coreNLP working:

  1. In the corenlp.py, change the path of the corenlp folder. Set the path where your local machine contains the corenlp folder and add the path in line 144 of corenlp.py

    if not corenlp_path: corenlp_path = <path to the corenlp file>

  2. The jar file version number in "corenlp.py" is different. Set it according to the corenlp version that you have. Change it at line 135 of corenlp.py

    jars = ["stanford-corenlp-3.4.1.jar", "stanford-corenlp-3.4.1-models.jar", "joda-time.jar", "xom.jar", "jollyday.jar"]

In this replace 3.4.1 with the jar version which you have downloaded.

  1. Run the command:

    python corenlp.py

This will start a server

  1. Now run the main client program

    python client.py

This provides a dictionary and you can access the coref using 'coref' as the key:

For example: John is a Computer Scientist. He likes coding.

{
     "coref": [[[["a Computer Scientist", 0, 4, 2, 5], ["John", 0, 0, 0, 1]], [["He", 1, 0, 0, 1], ["John", 0, 0, 0, 1]]]]
}

I have tried this on Ubuntu 16.04. Use java version 7 or 8.

Richela answered 16/3, 2017 at 13:9 Comment(1)
as of today the latest commit of that wrapper is in 2014Zaratite
Z
8

Stanford's CoreNLP has now an official Python binding called StanfordNLP, as you can read in the StanfordNLP website.

The native API doesn't seem to support the coref processor yet, but you can use the CoreNLPClient interface to call the "standard" CoreNLP (the original Java software) from Python.

So, after following the instructions to setup the Python wrapper here, you can get the coreference chain like that:

from stanfordnlp.server import CoreNLPClient

text = 'Barack was born in Hawaii. His wife Michelle was born in Milan. He says that she is very smart.'
print(f"Input text: {text}")

# set up the client
client = CoreNLPClient(properties={'annotators': 'coref', 'coref.algorithm' : 'statistical'}, timeout=60000, memory='16G')

# submit the request to the server
ann = client.annotate(text)    

mychains = list()
chains = ann.corefChain
for chain in chains:
    mychain = list()
    # Loop through every mention of this chain
    for mention in chain.mention:
        # Get the sentence in which this mention is located, and get the words which are part of this mention
        # (we can have more than one word, for example, a mention can be a pronoun like "he", but also a compound noun like "His wife Michelle")
        words_list = ann.sentence[mention.sentenceIndex].token[mention.beginIndex:mention.endIndex]
        #build a string out of the words of this mention
        ment_word = ' '.join([x.word for x in words_list])
        mychain.append(ment_word)
    mychains.append(mychain)

for chain in mychains:
    print(' <-> '.join(chain))
Zaratite answered 18/7, 2019 at 19:5 Comment(0)
R
7

stanfordcorenlp, the relatively new wrapper, may work for you.

Suppose the text is "Barack Obama was born in Hawaii. He is the president. Obama was elected in 2008."

enter image description here

The code:

# coding=utf-8

import json
from stanfordcorenlp import StanfordCoreNLP

nlp = StanfordCoreNLP(r'G:\JavaLibraries\stanford-corenlp-full-2017-06-09', quiet=False)
props = {'annotators': 'coref', 'pipelineLanguage': 'en'}

text = 'Barack Obama was born in Hawaii.  He is the president. Obama was elected in 2008.'
result = json.loads(nlp.annotate(text, properties=props))

num, mentions = result['corefs'].items()[0]
for mention in mentions:
    print(mention)

Every "mention" above is a Python dict like this:

{
  "id": 0,
  "text": "Barack Obama",
  "type": "PROPER",
  "number": "SINGULAR",
  "gender": "MALE",
  "animacy": "ANIMATE",
  "startIndex": 1,
  "endIndex": 3,
  "headIndex": 2,
  "sentNum": 1,
  "position": [
    1,
    1
  ],
  "isRepresentativeMention": true
}
Razor answered 15/1, 2018 at 11:12 Comment(1)
How do we create { "coref": [[[["a Computer Scientist", 0, 4, 2, 5], ["John", 0, 0, 0, 1]], [["He", 1, 0, 0, 1], ["John", 0, 0, 0, 1]]]] } from your JSON output?Shaftesbury
P
1

Maybe this works for you? https://github.com/dasmith/stanford-corenlp-python If not, you can try to combine the two yourself using http://www.jython.org/

Pettifogging answered 15/9, 2016 at 15:16 Comment(1)
Tried it, doesn't work. Jython is java implementation of python, jython.org/docs/tutorial/indexprogress.html, I can't use it since many of my modules uses NLTK which I can not reimplement in java. Any other suggestion?Harelda

© 2022 - 2024 — McMap. All rights reserved.