Use tf-idf with FastText vectors
Asked Answered
U

3

7

I'm interested in using tf-idf with FastText library, but have found a logical way to handle the ngrams. I have used tf-idf with SpaCy vectors already for what I have found several examples like these ones:

But for FastText library is not that clear to me, since it has a granularity that isn't that intuitive, E.G.

For a general word2vec aproach I will have one vector for each word, I can count the term frequency of that vector and divide its value accordingly.

But for fastText same word will have several n-grams,

"Listen to the latest news summary" will have n-grams generated by a sliding windows like:

lis ist ste ten tot het...

These n-grams are handled internally by the model so when I try:

model["Listen to the latest news summary"] 

I get the final vector directly, hence what I have though is to split the text into n-grams before feeding the model like:

model['lis']
model['ist']
model['ten']

And make the tf-idf from there, but that seems like an inefficient approach both, is there a standar way to apply tf-idf to vector n-grams like these.

Unpleasantness answered 23/9, 2019 at 20:28 Comment(2)
What do you mean with "I can count the term frequency of that vector and divide its value accordingly". Where do you find that frequency stored? Directly in Spacy? Also, are you sure that FastText uses trigrams "in between" words? like the trigram 'tot' in your example.Cubiform
@Cubiform Not in spacy, I used gensim to count the words I followed this tutorial for that: dsgeek.com/2018/02/19/tfidf_vectors.html. Regarding the ngrams, that is the information I have found, I have not looked into the source code.Unpleasantness
C
2

I would leave FastText deal with trigrams, but keep building the tfidf-weighted embeddings at the word level.

That is, you send

model["Listen"]
model["to"]
model["the"]
...

to FastText, and then use your old code to get the tf-idf weights.

In any case, it would be good to know whether FastText itself considers the word construct when processing a sentence, or it truly only works it as a sequence of trigrams (blending consecutive words). If the latter is true, then for FastText you would lose information by breaking the sentence into separate words.

Cubiform answered 29/9, 2019 at 15:50 Comment(0)
H
2

You are talking about fasttext tokenization step (not fasttext embeddings) which is a (3,6) char-n-gram tokenization, compatible with tfidf. The full step can be computed outside of fasttext quite easily Calculate TF-IDF using sklearn for n-grams in python

Hewet answered 14/10, 2019 at 9:59 Comment(0)
R
0

For what I understood from your question you are confusing the difference between word embeddings methods (such as word2vec and many other) and Tf-Idf:

  • Basically Word Embeddings methods are unsupervised models for generating word vectors. The word vectors generated by this kind of models are now very popular in NPL tasks. This is because a word embedding representation of a word captures more information about a word than just a one-hot representation of the word, since the former captures semantic similarity of that word to other words
    whereas the latter representation of the word is equidistant from all other words. FastText is another way to implements word embedding (recently opensourced by facebook researcher).
  • Tf-idf, instead is a scoring scheme for words, that is a measure of how important a word is to a document.

From a practical usage standpoint, while tf-idf is a simple scoring scheme and that is its key advantage, word embeddings may be a better choice for most tasks where tf-idf is used, particularly when the task can benefit from the semantic similarity captured by word embeddings (e.g. in information retrieval tasks).

Unlike Word2Vec that learn a vector representation of the entire word, FastText learn a representation for each n-gram of the word as you already seen. So the overall word embeddings is the sum of the n-gram representation. Basically FastText model (number of n-grams > number of words), it performs better than Word2Vec and allows rare words to be represented appropriately.

For my standpoint in general It does not make sense use FastText (or any word embeddings methods) together with Tf-Idf. But if you want use Tf-Idf with FastText you must sum all the n-gram that compose your word and use this representation to calculate the Tf-Idf.

Radiomicrometer answered 30/9, 2019 at 8:8 Comment(2)
I'm not confusing them, if you look into the links I put they have similar implementations but with SpaCy instead of FastTextUnpleasantness
@FabioL: It does make sense to combine tf-idf and word embeddings to generate a BoWs representation for a given document, where instead of directly averaging all word embeddings, you take a weighted average (based on their tf-idf scores).Cubiform

© 2022 - 2024 — McMap. All rights reserved.