Java Command Fails in NLTK Stanford POS Tagger
Asked Answered
C

2

3

I request your kind help and assistance in solving the error of "Java Command Fails" which keeps throwing whenever I try to tag an Arabic corpus with size of 2 megabytes. I have searched the web and stanford POS tagger mailing list. However, I did not find the solution. I read some posts on problems similar to this, and it was suggested that the memory is used out. I am not sure of that. Still I have 19GB free memory. I tried every possible solution offered, but the same error keeps showing.

I have average command on Python and good command on Linux. I am using LinuxMint17 KDE 64-bit, Python3.4, NLTK alpha and Stanford POS tagger model for Arabic . This is my code:

import nltk
from nltk.tag.stanford import POSTagger
arabic_postagger = POSTagger("/home/mohammed/postagger/models/arabic.tagger", "/home/mohammed/postagger/stanford-postagger.jar", encoding='utf-8')

print("Executing tag_corpus.py...\n")


# Import corpus file
print("Importing data...\n")

file = open("test.txt", 'r', encoding='utf-8').read()
text = file.strip()

print("Tagging the corpus. Please wait...\n")

tagged_corpus = arabic_postagger.tag(nltk.word_tokenize(text))

IF THE CORPUS SIZE IS LESS THAN 1MB ( = 100,000 words), THERE WILL BE NO ERROR. BUT WHEN I TRY TO TAG 2MB CORPUS, THEN THE FOLLOWING ERROR MESSAGE IS SHOWN:

Traceback (most recent call last):
File "/home/mohammed/experiments/current/tag_corpus2.py", line 17, in <module>
tagged_lst = arabic_postagger.tag(nltk.word_tokenize(text))
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 59, in tag
return self.batch_tag([tokens])[0]
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 81, in batch_tag
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/internals.py", line 171, in java
raise OSError('Java command failed!')
OSError: Java command failed!

I intend to tag 300 Million words to be used in my Ph.D. research project. If I keep tagging 100 thousand words at a time, I will have to repeat the task 3000 times. It will kill me!

I really appreciate your kind help.

Coston answered 25/11, 2014 at 0:2 Comment(2)
BTW you're too greedy to do tokenization and POS tagging at the same time for 300 Mil sentences. I would have done it in batches. Try 1 million per call and then you're running only 300 times.Parchment
@Parchment Thank you for your comment. In fact, I will do what you suggested. But I forgot to state it clearly in the question.Coston
P
4

After your import lines add this line:

nltk.internals.config_java(options='-xmx2G')

This will increase the maximum RAM size that java allows Stanford POS Tagger to use. The '-xmx2G' changes the maximum allowable RAM to 2GB instead of the default 512MB.

See What are the Xms and Xmx parameters when starting JVMs? for more information


If you're interested in how to debug your code, read on.

So we see that the command fail when handling huge amount of data so the first thing to look at is how the Java is initialized in NLTK before calling the Stanford tagger, from https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L19 :

from nltk.internals import find_file, find_jar, config_java, java, _java_options

We see that the nltk.internals package is handling the different Java configurations and parameters.

Then we take a look at https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L65 and we see that the no value is added for the memory allocation for Java.

Parchment answered 25/11, 2014 at 1:3 Comment(5)
@elvas Thank you so much for kind reply. I did what you suggested. Unfortunately, the same problem persists. After waiting for about a minute, the same error is thrown again.Coston
How much RAM do you have on the machine? Have you tried bigger RAM size and see how many more tokens you can tag? Don't be greedy and do tokenize and tag at the same time.Parchment
@elvas, I have 4GB RAM. When I tag 100,000 words, it works fine. When I try to tag 200,000 words, it throws the error. I am not greedy for tokenizing and tagging at the same time. I don't know how to do it separately. Could you please show me how to.Coston
First try to tag 200,000 words with -Xmx4G. You need to first learn how to use the Java commands if you want to use the NLTK's wrapper for the Stanford Tagger. Read the #14763579Parchment
@elvas, I tried to tag 200k words, but the Java Command failed again. I tagged 100k words and it worked fine. I kept increasing 10k words every time until it reached 140k words and it worked fine. Then increasing to 150k words caused the JAVA COMMAND to fail again. Now I have to split the large files into small slices of 140k words and process 10 slices at a time using glob module. Thank you for your kind help. Your suggestions were so insightful.Coston
K
0

In version 3.9.2, the StanfordTagger class constructor accepts a parameter called java_options which can be used to set the memory for the POSTagger and also the NERTagger.

E.g. pos_tagger = StanfordPOSTagger('models/english-bidirectional-distsim.tagger', path_to_jar='stanford-postagger-3.9.2.jar', java_options='-mx3g')

I found the answer by @alvas to not work because the StanfordTagger was overriding my memory setting with the built-in default of 1000m. Perhaps using nltk.internals.config_java after initializing StanfordPOSTagger might work but I haven't tried that.

Keystone answered 12/3, 2019 at 20:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.