Stanford POS tagger in Java usage
Asked Answered
B

4

11
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)
Mar 9, 2011 1:22:06 PM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: � (U+FFFD, decimal: 65533)

These are the errors that I'm getting when I want to assign POS tags to sentences. I read sentences from a file. Initially (for few sentences) I'm not getting this error (i.e untokenizable), but after reading some sentences this error arises. I use v2.0 (i.e. 2009) of POS tagger and model is left3words.

Brant answered 9/3, 2011 at 8:2 Comment(2)
seems you sharing internal apis, please remove that and put your generic question and required exception message. not with class names. for security constraints ...Blasphemy
could you please post the solution to this?Frick
W
8

I agree with Yuval -- a character encoding problem, but the commonest case is actually when the file is in a single byte encoding such as ISO-8859-1 while the tagger is trying to read it in UTF-8. See the discussion of U+FFFD on Wikipedia.

Wende answered 10/3, 2011 at 4:39 Comment(4)
Actually i'm not giving a file as a whole for tagging.I give sentences extracted from a file for tagging.The code,i have used in my project is as follows: List<Sentence<? extends HasWord>> sentences = MaxentTagger.tokenizeText(new StringReader(string1)); for (Sentence<? extends HasWord> sentence : sentences) { Sentence<TaggedWord> tSentence = MaxentTagger.tagSentence(sentence); tag_s1_local=tSentence.toString(false); }Brant
But it looks like your input String has got U+FFFD characters in it, which just shouldn't normally happen, and seems to reflect an earlier problem with character encoding in whatever code produced that String. What do you get if you just print out the characters of the String one by one with charAt()?Wende
It's printing the original characters for some sentences which don't has some characters like !,","..etc.but when it encounters these characters the problem arises.Brant
If I'm interpreting your last comment correctly, this shows that the String contents are messed up before you even call the tagger. You need to fix that (read about character encodings).Wende
S
2

This looks like an encoding problem to me. Can you post the offending sentence? I couldn't find this in the documentation, but I would try checking if the file is in UTF-8 encoding.

Snowball answered 9/3, 2011 at 9:6 Comment(3)
i have converted the sentences into UTF-8 format after reading it from a file and trying to tag.Initially no problem for me for few sentences.After completing few sentences only that warning arises.The code is:String string1=file_read.readLine(); byte[] utf81 = string1.getBytes("UTF-8"); string1 = new String(utf81, "UTF-8"); After this line String1 is passed to tagger as i have shown in the above comment.Brant
Reading your code and Christopher Manning's answer, I believe you are starting this the wrong way. Your input file should be in UTF-8 encoding to begin with. If it is in a single byte encoding, the tagger cannot recover the original characters.Snowball
Sometimes the easiest way is to just convert the input, but you don't need to. Any recognized encoding will work. But the way you're trying to deal with encodings looks completely wrong. In Java, if you give the encoding to an InputStreamReader, it will convert the data as it is read. You can't read the String with the default encoding (whatever that is...) and then try to convert it to what you want, since it'll be messed up when being read if the encodings don't match. You could read bytes via an InputStream and then convert to a Unicode String, but that is more painful than necessary.Wende
R
1

I ran into this issue, as well. One way to test whether a character is tokenizable is to check whether it fails Character.isIdentifierIgnorable(). A character that is untokenizable will return true, while all tokenizable characters will return false.

Radtke answered 11/7, 2014 at 21:55 Comment(0)
G
0

If you are reading content from DOC, Portable Document Format(PDF) then Use Apache Tika. It Will extract your content. It might help you.

Apache Tika

About tika

Apache Tika is a toolkit for detecting and extracting meta data and structured text content from various documents using existing parser libraries. It is written in Java, but includes a command line version for use from other languages.

More information on Tika, the bug tracker, mailing lists, downloads and more are available at http://tika.apache.org/

Grieg answered 1/8, 2013 at 6:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.