processing before or after train test split

Asked 28/8, 2019 at 13:15 Answered 28/8, 2019 at 13:34

Solved keras scikit-learn nlp tokenize train-test-split

I am using this excellent article to learn Machine learning.

https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/

The author has tokenized the X and y data after splitting it up.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)

X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)

vocab_size = len(tokenizer.word_index) + 1

maxlen = 200

X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)

If I tokenize it before using train_test_split class, I can save a few lines of code.

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X)

X_t = tokenizer.texts_to_sequences(X)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 200

X = pad_sequences(X_t, padding="post", maxlen=maxlen)

I just want to confirm that my approach is correct and I do not expect any surprises later in the script.

Patton answered 28/8, 2019 at 13:15 Comment(2)

Not a programming question, hence arguably off-topic here; better suited for Cross Validated – Emolument 28/8, 2019 at 13:22

The simple golden rule for ML preprocessing is to treat your test data as not available during the whole pipeline of model fitting. See the discussion here (although in a slightly different context). – Emolument 28/8, 2019 at 13:31

Both approaches will work in practice. But fitting the tokenizer on the train set and the applied it to both train and test set is better than fitting on the whole dataset. Indeed with the first method you are mimicking the fact that unseen words by the model will appear at some point after deploying your model. Thus your model evaluation will be closer to what will happen in a production environnement.

Illogical answered 28/8, 2019 at 13:25 Comment(0)

Agreed with @desertnaut's comment that the question is better suited for "Cross Validated", you'll get a better response there. But I'd still like to make a remark.

TL;DR: Don't do it, it's generally not a good idea to cross contaminate your training and test set. It's not statistically correct to do so.

The Tokenizer.fit_to_texts(dictionary) does the word indexing, i.e. it builds a translation of your any sequence of words to numbers(vector representation), so it might be that vocabulary difference between the training and test set is not a null set, i.e. some of the words in test are not present in the word indexer built by the Tokenizer object if it used only the train data. Which could result in some test set generating different vector if you'd have trained your tokenizer only on training set.

Since the test sets in a learning problem are supposed to be hidden, using it during any process of training the model is statistically not correct.

Rikkiriksdag answered 28/8, 2019 at 13:34 Comment(0)

To add to Simons post, I would say it is even forbidden to tokenize before the splitting.

The algorithm would learn with data from the tokenizer, which is strictly for testing the algorithm. This is the main approach between train and test set.

Revenge answered 28/8, 2019 at 13:28 Comment(0)

Recommended topics

Hot tags