Defining vocabulary size in text classification
Asked Answered
R

2

6

I have a question regarding the defining of vocabulary set needed for feature extraction in text classification. In an experiment, there are two approaches I can think of:

1.Define vocabulary size using both training data and test data, so that no word from the test data would be treated as being 'unknown' during the testing.

2.Define vocabulary size according to data only from the training data, and treat every word in the testing data that does not also appear in the training data as 'unknown'.

At first glance the more scientific way is the second one. However it is worth noticing that although there is no way we can know about the true size of vocabulary in a practical system, there seems to be no problem to set the vocabulary size a little bit larger than the size appeared in the training data in order to cover potentially larger problems. This is helpful in that it actually treats different unknown words as being different, instead of summing them up as 'unknown'. Is there any reason why this is not practical?

New to machine learning. Help much appreciated.

Rochkind answered 2/7, 2016 at 2:44 Comment(0)
S
8

If you include the test set words that don't occur in the training set into your model (e.g. a classification model) then because they have not occurred in the training set, their weight in the trained model will be zero and so they won't have any effect other than increasing the model size. So option 2 is better.

Having said that, to compensate for the changing nature of your test data, one solution is to re-train your model periodically, Another is to use word2vec to build representations and a K-Nearest Neighbour model that given each unseen word in the test set gives you the nearest word in the training set so that you can use that one instead of the unknown word.

Separates answered 4/7, 2016 at 4:7 Comment(0)
C
1

In actual world, a nlp system always need to deal with unknown words.

If you use the test data as part of your vocabulary set, when you do testing, your model will not face such situation. The metrics are broken, and can't show you the real performance your model have.

This is an important part of both knowledge discovery and natural language processing, you can google natural language processing unknown words for details, theory and common methods that models used to solve this situation.

If you just want some tools to handle unknown words, word2vec may be good for you.

Clown answered 2/7, 2016 at 5:32 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.