Linguistic tagger incorrectly tagging as 'OtherWord'
Asked Answered
B

2

8

I've been using NSLinguisticTagger with sentences and have been encountering a strange issue with sentences such as 'I am hungry' or 'I am drunk'. Whilst one would expect 'I' to be tagged as a pronoun, 'am' as a verb and 'hungry' as an adjective, they are not. Rather they are all tagged as OtherWord.

Is there something I'm doing incorrectly?

NSString *input = @"I am hungry";
NSLinguisticTaggerOptions options = NSLinguisticTaggerOmitWhitespace;
NSLinguisticTagger *tagger = [[NSLinguisticTagger alloc] initWithTagSchemes:[NSLinguisticTagger availableTagSchemesForLanguage:@"en"] options:options];
tagger.string = input;

[tagger enumerateTagsInRange:NSMakeRange(0, input.length) scheme:NSLinguisticTagSchemeNameTypeOrLexicalClass options:options usingBlock:^(NSString *tag, NSRange tokenRange, NSRange sentenceRange, BOOL *stop) {
    NSString *token = [input substringWithRange:tokenRange];
    NSString *lemma = [tagger tagAtIndex:tokenRange.location
                                  scheme:NSLinguisticTagSchemeLemma
                              tokenRange: NULL
                           sentenceRange:NULL];
    NSLog(@"%@ (%@) : %@\n", token, lemma, tag);
}];

And the output is:

I ((null)) : OtherWord
am ((null)) : OtherWord
hungry ((null)) : OtherWord
Bouzoun answered 27/3, 2015 at 22:33 Comment(7)
Very strange, I am playing around with the sentence - inserting a very results in a valid sentence, inserting a not does not, inserting not very yet will... yeiks... And i tried for german: some two words sentences dont work, words though seem to work. (yet sometimes not as perfect as expected). But it is definitely a strange issue.Feathers
And we are hungry works, he is hungry works, just i am hungry does not :/ using the adjective thirsty works for all threeFeathers
@Feathers Yeah, seeing the same thing. Seems that inserting an adjective/adverb causes it to tag correctly and any other pronoun seems fine too.Bouzoun
@Feathers Also, are you seeing that 'thirsty' is tagged as a number?Bouzoun
no, that is correctly tagged on my side. But I just took a look into the probabilites of am in case of hungry in contrast to thirsty - it is 100% OtherWord, not even a slight chance for verb...Feathers
@Feathers For me I'm seeing it tagged as 'Number' when using enumerateTagsInRange but when checking for possible tags using possibleTagsAtIndex it's an 'Adjective'. Also seeing the same 100% probability for OtherWord.Bouzoun
Let us continue this discussion in chat.Feathers
F
12

After quite some time in chat we found the issue:

The sentence does not contain enough information to determine its language.

To fix this you can either:

add a demo sentence in your language of choice after your actual sentence. That should guarantee your preferred language gets detected.

OR

Tell the tagger what language to use: add the line

[tagger setOrthography:[NSOrthography orthographyWithDominantScript:@"Latn" languageMap:@{@"Latn" : @[@"en"]}] range:NSMakeRange(0, input.length)];

before the enumerate call. That way you explicitly tell the tagger what language you want the text to be in, in this case englisch (en) as part of the latin dominant language (Latn).

If you dont know the language for sure, it may be usefull to use either of theses methods only as a fallback if the words get tagged as OtherWord meaning the language could not be detected.

Feathers answered 27/3, 2015 at 23:53 Comment(8)
You are welcome, was a fun brain teaser to look deep into new things ;)Feathers
I can't seem to get either of these solutions to fix the issue. It still returns (OtherWord). I even tried it with the default sentence in the OPNaashom
@WillVonUllrich are you using swift or objective-c? What is your sample input? What is your language? I just translated the code to swift to be able to test it quicker, still works.Feathers
hmm.. objc, "I have a cat", english, NSLinguisticTagSchemeNameTypeOrLexicalClass, and NSLinguisticTaggerOmitWhitespaceNaashom
@WillVonUllrich hmm, I cannot really help you, I tried the code out just now, both with your input and the original one, both times it gets correctly tagged.Feathers
hmm.. ok I'll see what else I can doNaashom
One last note - printing out the available schemes on my phone (6s running 11.2) shows ONLY the Language, Script, and TokenType schemes... no lexical... wtfNaashom
@Feathers you have this successfully working on a physical device? I STILL can not get the device to print anything other than (OtherWord), nor can I get it to print out Lexical scheme in the list of available schemes... im on the latest Xcode and 11.2 iOS 6sNaashom
T
2

The API is a confusing mess, so keep in mind there are multiple ways to pass in schemes or options to try and express the parts-of-speech data you want.

Even beyond that, the schemes you pass into the tagger have to match the ones you pass into the enumeration method on the tagger or it'll return OtherWord or it won't enumerate at all.

If you specify .nameType as the scheme, then it will return OtherWord for all words that are not names of people, places, etc.

If you change it to .nameTypeOrLexicalClass then it will describe each word's part-of-speech that isn't a name as well as list the name types.

This also depends on the unit given; e.g. for a paragraph unit it will show Other since it can't describe it as a Noun or something similar.


Example 1

Let's say you instantiate your tagger like this:

let text = "Frank said 'Don't take the easy way out. Build it slowly and meaningfully. Work multiple jobs to pay the bills if you need to. Take breaks and then pick it back up and keep moving. This is what it is; don't dilute the experience.'"
let options: NSLinguisticTagger.Options = [.omitWhitespace, .omitPunctuation]
let tagger = NSLinguisticTagger(tagSchemes: [.nameType, .lexicalClass, .lemma], options: Int(options.rawValue))
tagger.string = text
let range = NSRange(location: 0, length: text.utf8.count)
tagger.enumerateTags(in: range, unit: .word, scheme: .nameType, options: options) { tag, tokenRange, _ in
    guard let tag = tag else { return }
    let token = (text as NSString).substring(with: tokenRange)
    print("{token: \(token), tag: \(tag.rawValue), range: \(tokenRange)}")
}

That will print out something like

{token: Frank, tag: PersonalName, range: {0, 5}}

{token: said, tag: OtherWord, range: {6, 4}}

{token: Do, tag: OtherWord, range: {12, 2}}

{token: n't, tag: OtherWord, range: {14, 3}}

{token: take, tag: OtherWord, range: {18, 4}}

Even though your tagger was instantiated with the .lexicalClass scheme, your enumeration request was called only with the .nameType scheme. This means the tagger has the lexical class information available, but you aren't enumerating over it.

The API is a bit confusing in this way, but it's because the enumeration can be run over the same tagger in multiple ways with the tagger acting as the data set.

To make it even more confusing, if you tried to change the enumeration's scheme to .nameTypeOrLexicalClass, it wouldn't enumerate anything. This is because you didn't pass in the .nameTypeOrLexicalClass to your tagger options and instead passed them in separately as .nameType, .lexicalClass. Seriously, this a terrible API design but it is what it is.


Example 2

If you wanted to instead have both the names and lexical classes, you need to explicitly use the .nameTypeOrLexicalClass option:

let text = "Frank said 'Don't take the easy way out. Build it slowly and meaningfully. Work multiple jobs to pay the bills if you need to. Take breaks and then pick it back up and keep moving. This is what it is; don't dilute the experience.'"
let options: NSLinguisticTagger.Options = [.omitWhitespace, .omitPunctuation]
let tagger = NSLinguisticTagger(tagSchemes: [.nameTypeOrLexicalClass, .lemma], options: Int(options.rawValue))
tagger.string = text
let range = NSRange(location: 0, length: text.utf8.count)
tagger.enumerateTags(in: range, unit: .word, scheme: .nameTypeOrLexicalClass, options: options) { tag, tokenRange, _ in
    guard let tag = tag else { return }
    let token = (text as NSString).substring(with: tokenRange)
    print("{token: \(token), tag: \(tag.rawValue), range: \(tokenRange)}")
}

That will print out something like

{token: Frank, tag: PersonalName, range: {0, 5}}

{token: said, tag: Verb, range: {6, 4}}

{token: Do, tag: Verb, range: {12, 2}}

{token: n't, tag: Adverb, range: {14, 3}}

{token: take, tag: Verb, range: {18, 4}}

{token: the, tag: Determiner, range: {23, 3}}


Language

For extra debugging, you can set the language as well or check if the scheme you want is even available at all in the language you're trying to tag (since it depends largely on the language):

// check that it determined the language correctly
print("dominant language is: \(tagger.dominantLanguage)")

// or set the language specifically
tagger.setOrthography(NSOrthography.defaultOrthography(forLanguage: "en-US"), range: range)

// or list which tag schemes are even available for this language
let tagSchemes = NSLinguisticTagger.availableTagSchemes(forLanguage: "en")
print(tagSchemes)
Tube answered 19/4, 2018 at 14:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.