I am using this excellent article to learn Machine learning.
https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/
The author has tokenized the X and y data after splitting it up.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 200
X_train = pad_sequences(X_train, padding="post", maxlen=maxlen)
X_test = pad_sequences(X_test, padding="post", maxlen=maxlen)
If I tokenize it before using train_test_split class, I can save a few lines of code.
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(X)
X_t = tokenizer.texts_to_sequences(X)
vocab_size = len(tokenizer.word_index) + 1
maxlen = 200
X = pad_sequences(X_t, padding="post", maxlen=maxlen)
I just want to confirm that my approach is correct and I do not expect any surprises later in the script.