I have a question regarding the defining of vocabulary set needed for feature extraction in text classification. In an experiment, there are two approaches I can think of:
1.Define vocabulary size using both training data and test data, so that no word from the test data would be treated as being 'unknown' during the testing.
2.Define vocabulary size according to data only from the training data, and treat every word in the testing data that does not also appear in the training data as 'unknown'.
At first glance the more scientific way is the second one. However it is worth noticing that although there is no way we can know about the true size of vocabulary in a practical system, there seems to be no problem to set the vocabulary size a little bit larger than the size appeared in the training data in order to cover potentially larger problems. This is helpful in that it actually treats different unknown words as being different, instead of summing them up as 'unknown'. Is there any reason why this is not practical?
New to machine learning. Help much appreciated.