Why does the Penn Treebank POS tagset have a separate tag for the word 'to'?
Asked Answered
B

1

7

The Penn Treebank tagset has a separate tag TO for the word 'to', irrespective of whether it's used in the preposition sense (such as I went to school) or the infinitive sense (such as I want to eat). What purpose does this serve from an overall NLP perspective? Just tagging the infinitival 'to' separately makes intuitive sense, but I don't see the logic behind combining an infinitive and a preposition in a single tag.

Thanks, and apologies if this doesn't fit the stack overflow guidelines.

Baluchistan answered 29/9, 2013 at 15:5 Comment(0)
T
2

Different corpora provide different levels of granularity. Compare this, for instance, to the British National Corpus, which includes three different tags for to.

I believe this may have come as a property of the corpus tagging practice rather than from such a specific NLP performance purpose. It's not that unlikely to imagine that it was a design decision of the POS Guidelines for the Penn Treebank Project. (Contacting the authors of this paper for further clarification.)

In order for the POS tagset not to have a separate tag for the word "to", it would sometimes need to tag "to" as a preposition, and to sometimes tag "to" with a different tag for "infinitive marker." For this to happen, a human tagger would have had to disambiguate between both roles of "to." Some tricky cases (which require grammaticality judgments) may require some extra human time to disambiguate, which may also lead to some mistagging given the size of the corpus tagged. This tradeoff may have erred more on the side of efficiency and correctness if the information gain (from the granularity of having to disambiguated) was estimated to be not that large, or if the potential tagging errors were estimated to be too many.

Theatrician answered 22/12, 2013 at 21:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.