What are the entity types for NLTK?
Asked Answered
O

1

13

I've been trying to find the full list of entity types of NLTK. I was only able to find the most common ones on this page, but not the full list. Could you please share the full list of named entity types NLTK has?

Orthopedist answered 20/7, 2017 at 9:37 Comment(3)
Maybe https://mcmap.net/q/907066/-ne_chunk-without-pos_tag-in-nltk would help ;PScottie
That's nice @Scottie but even after slogging through the explanations there and clicking through to the three types of NE classes, none of the three lists match the actual labels reported by the chunker... and none includes "GSP", what's that about?Forereach
@Forereach good point. I suspect the ne_chunk is an arcane artifact of Python port of Stanford's Java maxent tagger. Yeah, GSP looks like an ancient tag that people don't use nowadays, I would have just simply tag it as ORGANIZATION. Issue raised: github.com/nltk/nltk/issues/1783Scottie
F
13

That's a very good question, I've wondered the same myself. It doesn't seem to be documented anywhere, even in the nltk source, and of course it is determined by the corpus that the chunker was trained on-- which, it seems, is or was the ACE corpus, which is not distributed with the nltk.

A little bit of digging around in the source turned up the answer:

>>> chunker=nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER) # cf. nltk/chunk/__init__.py
>>> sorted(chunker._tagger._classifier.labels())
['B-FACILITY', 'B-GPE', 'B-GSP', 'B-LOCATION', 'B-ORGANIZATION', 'B-PERSON', 
 'I-FACILITY', 'I-GPE', 'I-GSP', 'I-LOCATION', 'I-ORGANIZATION', 'I-PERSON',
 'O']

Note that some of the "common" types mentioned in the book, including DATE and TIME, are not actually detected by this chunker. GPE stands for Geo-Political Entity, GSP stands for Geographical-Social-Political Entity, an older tag that was replaced by GPE in the ACE project. From their definition (see links below) they seem to be pretty much equivalent.

Edit, January 2019: Prompted by Daniel's question, I looked at the documentation of the ACE project myself in search of a description of these entities. Sure enough, this page links to documentation for each phase of the project. The entity names listed above, including the mysterious GSP but without the GPE entity, were used through phase 1 of the project. Starting with phase 2, GPE replaced GSP on the list. One has to wonder how the nltk chunker ended up being trained on both GPE and GSP, or how it decides between the two. My best guess is that it was trained on a combination of Phase 1 and Phase 2 materials.

Forereach answered 20/7, 2017 at 11:57 Comment(4)
Thanks, @alexis. I was planning to design my database as to these types, but this seems even more confusing. In this document, it also mentions 7 entities((PER, ORG, GPE, LOC, FAC, VEH, WEA), but only discloses 5 of them. I have no idea what VEH and WEA areOrthopedist
Vehicle and Weapon, according to the document I linked to. Whoever trained the nltk's chunker may have remapped some categories in the training data. Just choose categories that can be manually assigned reliably for your purposes, and for your text genre.Forereach
What does GSP stand for? I wasn't able to find any information on it in any documentation. The only thing I can think of is geospatial entity, which seems like it would be the same thing as GPE.Please
I took a look myself, and had better luck. See edited answer.Forereach

© 2022 - 2024 — McMap. All rights reserved.