NLP : Is Gazetteer a cheat

About

Asked 25/1, 2016 at 14:35 Answered 25/1, 2016 at 19:13

In NLP there is a concept of Gazetteer which can be quite useful for creating annotations. As far as i understand,

A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to ﬁnd occurrences of these names in text, e.g. for the task of named entity recognition.

So it is essentially a lookup. Isn't this kind of a cheat? If we use a Gazetteer for detecting named entities, then there is not much Natural Language Processing going on. Ideally, i would want to detect named entities using NLP techniques. Otherwise how is it any better than a regex pattern matcher.

Does that make sense?

Jabber answered 25/1, 2016 at 14:35 Comment(5)

Yes. Once again, interesting question but more suited for datascience.stackexchange.com =) Imagine this, if i have never seen an entity string before and i cannot guess from the context, whether something from a test sentence is an entity. Would I tag it as an entity? Now imagine, if i cannot guess from the context whether something from a test sentence is an entity but i know that from my "knowledge-base" or "gazetteer list" that this thing is an entity. Would I tag it as an entity? – Saberio 25/1, 2016 at 18:18

thanks @Saberio i guess what i am trying to say is how much of a performance gain can we get by using gazetteers as opposed to regex matching? i realize that it is impossible to create a regex that would match all possible organization names. but then why not create a lookup table for all such names? keep adding to it as new names and feedback comes in – Jabber 25/1, 2016 at 18:38

i will post this on datascience.stackexchange.com as well – Jabber 25/1, 2016 at 18:39

Read up on the history of entity recognition, knowledge base population and slot filling. Hopefully you get a sense why gazetteer is preferred over full blown regex rules. – Saberio 25/1, 2016 at 18:51

same question on Data Science - StackExchange by @AbtPst. – Novara 30/3, 2017 at 5:10

Depends on how you built/use your gazetteer. If you are presenting experiments in a closed domain and you custom picked your gazetteer, then yes, you are cheating. If you are using some openly available gazetteer and performing experiments on a large dataset or using it in an application in the wild where you don't control the input then you are fine. We found ourselves in a similar situation. We partition our dataset and use the training data to automatically build our gazetteers. As long as you report your methodology you should not feel like cheating (let the reviewers complain).

Quickstep answered 25/1, 2016 at 19:13 Comment(12)

how did you automatically build gazetters? – Jabber 25/1, 2016 at 20:19

Should had said semi-automatic. First I extracted referring expressions (NP nodes in the parse tree), clustered and classified by hand. Then I use the annotations on the training set to build the gazetteers I will use in my tests. k-fold cross validation. – Quickstep 25/1, 2016 at 23:41

;P i find this helpful python nltk_cli/senna.py --np file.txt for slot-filling candidates: github.com/alvations/nltk_cli – Saberio 26/1, 2016 at 3:26

I totally agree with Josep! Besides, what does "cheating" mean? A purist would argue that you shouldn't use either syntax, morpho-syntax, or even any lexicon! Great challenge for an utopian :) On the other hand, making an extensive list of entities based on all available documents will undiably provide biased results. IMHO you should make a list of entities of the domain, and also both 1/ report your methodology without omitting how gazetteer was fed 2/ do a clear separation of train and test set for evaluation, so that you don't know what are the entities in the test part. – Postwar 27/1, 2016 at 11:52

thanks for the helpful comments everyone! I can see now the value of the Gazetteers. Now, @Postwar suggests to create training and test sets. Will the Gazetteer still be helpful in this scenario? If the Gazetteer is just a lookup, how will it aid in 'learning`. Is it possible to build a classifier that uses a Gazetteer for training but is still able to pick up entities similar to the ones in the Gazetteer ? So if my Gazetteer has 1000 company names. Can i build a classifier to detect company names not present in the Gazetteer at all! – Jabber 27/1, 2016 at 15:7

i love how the word Gazetteer appears totally in sync in four lines above! it was totally unintentional on my part :) – Jabber 27/1, 2016 at 15:8

From the questions in your comments you should just think of a gazetteer as just another feature that can aid in building a classifier. It can still be helpful in finding terms that aren't in your gazetteer since it allows you to pull out the context around gazetteer terms in your training data. So the answer is yes to all the questions in your comment. – Penzance 27/1, 2016 at 18:18

@Penzance that is precisely how I work with them. If you are interested here is one of my previous articles: aaai.org/ocs/index.php/INT/INT7/paper/viewFile/9253/9204 – Quickstep 27/1, 2016 at 22:29

Using a gazetteer provides non-token features. You may use it as : 1/ binary (is the token in gazetteer?) 2/ class (what category is indicated for that token in gazetteer? location, org, etc. or fine-grained e.g. loc.town, loc.building, org.company). Besides, using approximate matching (case sensitivity, punctuations) will help detecting variations of names. For names that are not in the gazetteer at all, you'll probably have to look for morphological patterns (common prefixes and suffixes). If none of these work, look at context. – Postwar 29/1, 2016 at 9:44

@josep-valls papers looks interesting, you may also have a look at damien.nouvels.net/publications/… :) – Postwar 29/1, 2016 at 9:46

@Postwar thanks! in your paper you mention the Ester2 and Etape french corpora but the examples are in English. I was not aware of those datasets, are they public? If so it could be interesting for the OP to take a look regarding extracting gazetteers. – Quickstep 29/1, 2016 at 16:30

@josep-valls you're welcome! Unfortunately, Ester2 and Etape are not free, but you may have a look at Quaero corpus (see catalog.elra.info/search.php) which is free for research purposes. Also, as a side note: many gazetteers are extracted from Wikipedia. – Postwar 30/1, 2016 at 18:8

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags