NLP : Is Gazetteer a cheat
Asked Answered
J

1

15

In NLP there is a concept of Gazetteer which can be quite useful for creating annotations. As far as i understand,

A gazetteer consists of a set of lists containing names of entities such as cities, organisations, days of the week, etc. These lists are used to find occurrences of these names in text, e.g. for the task of named entity recognition.

So it is essentially a lookup. Isn't this kind of a cheat? If we use a Gazetteer for detecting named entities, then there is not much Natural Language Processing going on. Ideally, i would want to detect named entities using NLP techniques. Otherwise how is it any better than a regex pattern matcher.

Does that make sense?

Jabber answered 25/1, 2016 at 14:35 Comment(5)
Yes. Once again, interesting question but more suited for datascience.stackexchange.com =) Imagine this, if i have never seen an entity string before and i cannot guess from the context, whether something from a test sentence is an entity. Would I tag it as an entity? Now imagine, if i cannot guess from the context whether something from a test sentence is an entity but i know that from my "knowledge-base" or "gazetteer list" that this thing is an entity. Would I tag it as an entity?Saberio
thanks @Saberio i guess what i am trying to say is how much of a performance gain can we get by using gazetteers as opposed to regex matching? i realize that it is impossible to create a regex that would match all possible organization names. but then why not create a lookup table for all such names? keep adding to it as new names and feedback comes inJabber
i will post this on datascience.stackexchange.com as wellJabber
Read up on the history of entity recognition, knowledge base population and slot filling. Hopefully you get a sense why gazetteer is preferred over full blown regex rules.Saberio
same question on Data Science - StackExchange by @AbtPst.Novara
Q
7

Depends on how you built/use your gazetteer. If you are presenting experiments in a closed domain and you custom picked your gazetteer, then yes, you are cheating. If you are using some openly available gazetteer and performing experiments on a large dataset or using it in an application in the wild where you don't control the input then you are fine. We found ourselves in a similar situation. We partition our dataset and use the training data to automatically build our gazetteers. As long as you report your methodology you should not feel like cheating (let the reviewers complain).

Quickstep answered 25/1, 2016 at 19:13 Comment(12)
how did you automatically build gazetters?Jabber
Should had said semi-automatic. First I extracted referring expressions (NP nodes in the parse tree), clustered and classified by hand. Then I use the annotations on the training set to build the gazetteers I will use in my tests. k-fold cross validation.Quickstep
;P i find this helpful python nltk_cli/senna.py --np file.txt for slot-filling candidates: github.com/alvations/nltk_cliSaberio
I totally agree with Josep! Besides, what does "cheating" mean? A purist would argue that you shouldn't use either syntax, morpho-syntax, or even any lexicon! Great challenge for an utopian :) On the other hand, making an extensive list of entities based on all available documents will undiably provide biased results. IMHO you should make a list of entities of the domain, and also both 1/ report your methodology without omitting how gazetteer was fed 2/ do a clear separation of train and test set for evaluation, so that you don't know what are the entities in the test part.Postwar
thanks for the helpful comments everyone! I can see now the value of the Gazetteers. Now, @Postwar suggests to create training and test sets. Will the Gazetteer still be helpful in this scenario? If the Gazetteer is just a lookup, how will it aid in 'learning`. Is it possible to build a classifier that uses a Gazetteer for training but is still able to pick up entities similar to the ones in the Gazetteer ? So if my Gazetteer has 1000 company names. Can i build a classifier to detect company names not present in the Gazetteer at all!Jabber
i love how the word Gazetteer appears totally in sync in four lines above! it was totally unintentional on my part :)Jabber
From the questions in your comments you should just think of a gazetteer as just another feature that can aid in building a classifier. It can still be helpful in finding terms that aren't in your gazetteer since it allows you to pull out the context around gazetteer terms in your training data. So the answer is yes to all the questions in your comment.Penzance
@Penzance that is precisely how I work with them. If you are interested here is one of my previous articles: aaai.org/ocs/index.php/INT/INT7/paper/viewFile/9253/9204Quickstep
Using a gazetteer provides non-token features. You may use it as : 1/ binary (is the token in gazetteer?) 2/ class (what category is indicated for that token in gazetteer? location, org, etc. or fine-grained e.g. loc.town, loc.building, org.company). Besides, using approximate matching (case sensitivity, punctuations) will help detecting variations of names. For names that are not in the gazetteer at all, you'll probably have to look for morphological patterns (common prefixes and suffixes). If none of these work, look at context.Postwar
@josep-valls papers looks interesting, you may also have a look at damien.nouvels.net/publications/… :)Postwar
@Postwar thanks! in your paper you mention the Ester2 and Etape french corpora but the examples are in English. I was not aware of those datasets, are they public? If so it could be interesting for the OP to take a look regarding extracting gazetteers.Quickstep
@josep-valls you're welcome! Unfortunately, Ester2 and Etape are not free, but you may have a look at Quaero corpus (see catalog.elra.info/search.php) which is free for research purposes. Also, as a side note: many gazetteers are extracted from Wikipedia.Postwar

© 2022 - 2024 — McMap. All rights reserved.