Difference between tokenize.fit_on_text
, tokenize.text_to_sequence
and word embeddings
?
Tried to search on various platforms but didn't get a suitable answer.
Difference between tokenize.fit_on_text
, tokenize.text_to_sequence
and word embeddings
?
Tried to search on various platforms but didn't get a suitable answer.
Word embeddings is a way of representing words such that words with the same/similar meaning have a similar representation. Two commonly used algorithms that learn word embedding are Word2Vec and GloVe.
Note that word embeddings can also be learnt from scratch while training your neural network for text processing, on your specific NLP problem. You can also use transfer learning; in this case, it would mean to transfer the learned representation of the words from huge datasets on your problem.
As for the tokenizer(I assume it's Keras that we're speaking of), taking from the documentation:
tokenize.fit_on_text()
--> Creates the vocabulary index based on word frequency. For example, if you had the phrase "My dog is different from your dog, my dog is prettier", word_index["dog"] = 0
, word_index["is"] = 1
(dog appears 3 times, is appears 2 times)
tokenize.text_to_sequence()
--> Transforms each text into a sequence of integers. Basically if you had a sentence, it would assign an integer to each word from your sentence. You can access tokenizer.word_index()
(returns a dictionary) to verify the assigned integer to your word.
© 2022 - 2024 — McMap. All rights reserved.