The API is a confusing mess, so keep in mind there are multiple ways to pass in schemes
or options
to try and express the parts-of-speech data you want.
Even beyond that, the schemes you pass into the tagger have to match the ones you pass into the enumeration method on the tagger or it'll return OtherWord
or it won't enumerate at all.
If you specify .nameType
as the scheme
, then it will return OtherWord
for all words that are not names of people, places, etc.
If you change it to .nameTypeOrLexicalClass
then it will describe each word's part-of-speech that isn't a name as well as list the name types.
This also depends on the unit
given; e.g. for a paragraph
unit it will show Other
since it can't describe it as a Noun
or something similar.
Example 1
Let's say you instantiate your tagger like this:
let text = "Frank said 'Don't take the easy way out. Build it slowly and meaningfully. Work multiple jobs to pay the bills if you need to. Take breaks and then pick it back up and keep moving. This is what it is; don't dilute the experience.'"
let options: NSLinguisticTagger.Options = [.omitWhitespace, .omitPunctuation]
let tagger = NSLinguisticTagger(tagSchemes: [.nameType, .lexicalClass, .lemma], options: Int(options.rawValue))
tagger.string = text
let range = NSRange(location: 0, length: text.utf8.count)
tagger.enumerateTags(in: range, unit: .word, scheme: .nameType, options: options) { tag, tokenRange, _ in
guard let tag = tag else { return }
let token = (text as NSString).substring(with: tokenRange)
print("{token: \(token), tag: \(tag.rawValue), range: \(tokenRange)}")
}
That will print out something like
{token: Frank, tag: PersonalName, range: {0, 5}}
{token: said, tag: OtherWord, range: {6, 4}}
{token: Do, tag: OtherWord, range: {12, 2}}
{token: n't, tag: OtherWord, range: {14, 3}}
{token: take, tag: OtherWord, range: {18, 4}}
Even though your tagger was instantiated with the .lexicalClass
scheme, your enumeration request was called only with the .nameType
scheme. This means the tagger has the lexical class information available, but you aren't enumerating over it.
The API is a bit confusing in this way, but it's because the enumeration can be run over the same tagger in multiple ways with the tagger acting as the data set.
To make it even more confusing, if you tried to change the enumeration's scheme to .nameTypeOrLexicalClass
, it wouldn't enumerate anything. This is because you didn't pass in the .nameTypeOrLexicalClass
to your tagger options and instead passed them in separately as .nameType, .lexicalClass
. Seriously, this a terrible API design but it is what it is.
Example 2
If you wanted to instead have both the names and lexical classes, you need to explicitly use the .nameTypeOrLexicalClass
option:
let text = "Frank said 'Don't take the easy way out. Build it slowly and meaningfully. Work multiple jobs to pay the bills if you need to. Take breaks and then pick it back up and keep moving. This is what it is; don't dilute the experience.'"
let options: NSLinguisticTagger.Options = [.omitWhitespace, .omitPunctuation]
let tagger = NSLinguisticTagger(tagSchemes: [.nameTypeOrLexicalClass, .lemma], options: Int(options.rawValue))
tagger.string = text
let range = NSRange(location: 0, length: text.utf8.count)
tagger.enumerateTags(in: range, unit: .word, scheme: .nameTypeOrLexicalClass, options: options) { tag, tokenRange, _ in
guard let tag = tag else { return }
let token = (text as NSString).substring(with: tokenRange)
print("{token: \(token), tag: \(tag.rawValue), range: \(tokenRange)}")
}
That will print out something like
{token: Frank, tag: PersonalName, range: {0, 5}}
{token: said, tag: Verb, range: {6, 4}}
{token: Do, tag: Verb, range: {12, 2}}
{token: n't, tag: Adverb, range: {14, 3}}
{token: take, tag: Verb, range: {18, 4}}
{token: the, tag: Determiner, range: {23, 3}}
Language
For extra debugging, you can set the language as well or check if the scheme you want is even available at all in the language you're trying to tag (since it depends largely on the language):
// check that it determined the language correctly
print("dominant language is: \(tagger.dominantLanguage)")
// or set the language specifically
tagger.setOrthography(NSOrthography.defaultOrthography(forLanguage: "en-US"), range: range)
// or list which tag schemes are even available for this language
let tagSchemes = NSLinguisticTagger.availableTagSchemes(forLanguage: "en")
print(tagSchemes)
very
results in a valid sentence, inserting anot
does not, insertingnot very
yet will... yeiks... And i tried for german: some two words sentences dont work, words though seem to work. (yet sometimes not as perfect as expected). But it is definitely a strange issue. – Featherswe are hungry
works,he is hungry
works, justi am hungry
does not :/ using the adjectivethirsty
works for all three – Feathersam
in case ofhungry
in contrast tothirsty
- it is 100% OtherWord, not even a slight chance for verb... – FeathersenumerateTagsInRange
but when checking for possible tags usingpossibleTagsAtIndex
it's an 'Adjective'. Also seeing the same 100% probability for OtherWord. – Bouzoun