Why does the Penn Treebank POS tagset have a separate tag for the word 'to'?

Different corpora provide different levels of granularity. Compare this, for instance, to the British National Corpus, which includes three different tags for to.

I believe this may have come as a property of the corpus tagging practice rather than from such a specific NLP performance purpose. It's not that unlikely to imagine that it was a design decision of the POS Guidelines for the Penn Treebank Project. (Contacting the authors of this paper for further clarification.)

In order for the POS tagset not to have a separate tag for the word "to", it would sometimes need to tag "to" as a preposition, and to sometimes tag "to" with a different tag for "infinitive marker." For this to happen, a human tagger would have had to disambiguate between both roles of "to." Some tricky cases (which require grammaticality judgments) may require some extra human time to disambiguate, which may also lead to some mistagging given the size of the corpus tagged. This tradeoff may have erred more on the side of efficiency and correctness if the information gain (from the granularity of having to disambiguated) was estimated to be not that large, or if the potential tagging errors were estimated to be too many.

Recommended topics

Hot tags